Header tag

Showing posts with label testing. Show all posts
Showing posts with label testing. Show all posts

Wednesday, 6 May 2026

Leveraging Data for Segment Identification and Personalised Experiences

In previous articles, I have explored the practical strategies for analysing user behaviour derived from website interactions and A/B testing methodologies. These techniques are particularly valuable when aiming to identify high-performing user segments and craft targeted digital experiences. However, this process often reveals unexpected insights, especially when initial testing does not yield favourable results.

Consider a scenario in which a newly introduced test design, despite its promising conceptual appeal, fails to deliver the desired impact. Quantitative results may be unequivocally negative, with every relevant key performance indicator (KPI) registering a decline. In such circumstances, the data suggest that the variation, such as Recipe B, is not simply underperforming; it has failed across the board. This presents two immediate paths forward: one may either accept the outcome, analyse the failure, learn from it, and iterate on future design efforts—or choose to segment the data to investigate underlying causes.

This latter approach can become a complex and treacherous endeavour, resembling a descent into a highly intricate and potentially misleading analytical rabbit hole.  Sorry, Alice!

For example:

  • An initial segmentation comparing new versus returning visitors may reveal that returning users did not react as negatively as first-time users.

  • A further segmentation might then show that returning users accessing the site via mobile devices performed slightly better than their desktop counterparts.

  • Digging even deeper, one might uncover that returning mobile users interested in higher-priced products exhibited improved engagement or conversion metrics.

While these findings may appear promising, they come with a significant trade-off. After applying successive layers of filtering, the relevant audience is reduced from an initial 50% of total traffic to just 4.3%. This raises a critical question: is it worth allocating substantial resources to tailor a unique experience for such a narrow segment?

For certain brands, particularly high-end luxury retailers such as Rolls Royce or Beaverbrooks, personalised attention may be both viable and beneficial. However, for businesses operating in more commoditised sectors, such as discount pet supplies, the return on investment for micro-targeting may be far less compelling.

This same dilemma applies when designing personalisation campaigns. As I have highlighted previously, two major challenges confront marketers:

  1. Acquiring and interpreting high-quality, granular data.

  2. Producing and managing sufficient content to serve these differentiated segments.

Assuming these hurdles are overcome to a satisfactory degree, one must still exercise caution in the granularity of targeting. Excessive segmentation risks over-engineering the user experience, while minimal segmentation may fail to resonate at all.

Let us consider a common digital merchandising challenge: What content should be displayed in the homepage hero banner? What recommendations should populate the "We think you'll like this..." carousel? Should your virtual storefront attempt to predict and proactively suggest specific products based on past user behaviour?

For instance, one might propose that a returning visitor be presented with a Lego set—unpromoted, without a discount—on the basis of prior browsing behaviour. This raises the question: is your targeting sophistication high enough to ensure that this suggestion will genuinely resonate?

Many practitioners point to brands such as Netflix and Amazon as exemplars in personalised recommendation systems. Statements like "Because you watched Star Trek: Deep Space Nine" provide a transparent rationale for the curated content that follows. These systems succeed because they offer breadth—providing up to 42 scrollable options—ensuring that even if the primary recommendation does not engage, several others might.

This model can be effectively emulated in retail contexts. For instance, a virtual toy store could present a curated range of 42 Lego models and invite user exploration. Such an approach is far more engaging than presenting a single product, especially when stock limitations may lead to user frustration if the highlighted item is unavailable.

A broader tactic, such as "Would you like to explore our Lego collection?" may prompt greater engagement than asserting, "We believe you want this specific model." The former understands more about user independence and improves interaction metrics (which are often paramount KPIs in digital strategy, right?).

Compare the following messages:

  • “Welcome to our toy shop! These are our favourite toys!”

  • “We think you’re interested in construction toys.”

The first message represents a generic push strategy, commonly found in homepage banners that prioritise brand objectives over user intent ('we' are more important than 'you' - the long-standing tension between marketing and customer needs). The second, although still broad, reflects an attempt to connect with presumed customer interests. In this context, even minimal targeting is advantageous—and likely more effective than overly narrow personalisation.

Let us assume predictive modelling yields:

  • A 71% likelihood a visitor is interested in Lego products.

  • A 23% likelihood they are seeking Lego Technic.

  • A 7% probability they want the Lego Technic Excavator

Rather than presenting the most specific item, a better approach would be to guide users toward the Lego Technic category and enable autonomous navigation from that point. This ensures relevancy while allowing for user discovery and choice.  Move users forwards one step down the funnel (or the rabbit hole), not two.  Maximise your chances of being correct, and let users navigate from there.

While the hypothetical website may sell any range of products, the Lego example is a globally recognisable, visually compelling use case. Naturally, any product category could be substituted in this framework. If I'd been analysing your browsing history, I might have tailored this piece to your specific interests.

Until next time!

Similar posts I've written about online testing

Friday, 24 April 2026

"Click For Free Money" on the Customer Journey

A few years ago, a colleague of mine highlighted a crucial insight: engagement on a button isn't always a reliable measure of success.  I know it's hardly an earth-shattering revelation, but it came at a time when KPIs were focusing on engagement, when he and I were looking at testing CTA wording.  Should we go for "Buy Now" or "Customise and Buy"?  Which is better - "View Details" or "Learn More"?  In short, we discovered that the best wording depended on the point in the purchase path, and we had to separate the analysis, because the button that achieved the highest click-through rate (CTR) wasn't always the one that led to the best overall conversion rate. 

In simpler terms, success isn't just about shuffling customers to the next page in our online purchase path. While persuasive wording can certainly encourage users to click and advance, we need to critically ask ourselves: does this genuinely help them progress beyond that initial stage, or are we just creating a false sense of momentum? This brings us to what I call the "Click for Free Money" fallacy. You can absolutely get people to click a button, but the subsequent page must deliver on the expectation set by that button's wording.  Click for Free Money will certainly get a lot of clicks, but there's no benefit to this approach, as people realise that the next page is actually just more information about our products, and maybe a coupon code to validate the 'free money' offer.  


The Promise and the Reality of the Click

Consider the user's mental model. When a button says "Add to Cart," the user fully expects to be taken to their cart fairly quickly. A brief, relevant detour to offer a highly personalized upsell might be acceptable, but anything more, or anything irrelevant, will create friction. Similarly, a "Customize" button should seamlessly lead to customization options, allowing the user to tailor their product without unnecessary distractions. And perhaps most critically, a "Checkout" button must initiate the checkout process. This is not the time for additional upsells, cross-sells, or any other interruptions that could derail a customer who is ready to complete their purchase. Each click builds an expectation, and failing to meet that expectation, even subtly, erodes trust and increases the likelihood of abandonment.  A lack of 'free money' will erode trust remarkably quickly, although it will probably be on the 'free money page, not the previous page.  Still, if you and your team are prepared to do anything to get users to move forwards from your page, you could deploy this tactic.  It depends on what your KPI actually is (and who's looking at the long-term journey).

The Detrimental Cost of Rushing Customers

Pushing customers forward too quickly isn't just inefficient; it can be detrimental to long-term customer relationships and conversion rates. If you don't continually reinforce the value proposition and reassure the user along their journey, they're highly likely to drop out at a later, more critical stage. For instance, if you enticed a user with a 10% discount while they were casually Browse your site, that discount needs to be prominently displayed and automatically applied on the cart page. Hiding it, or making them jump through hoops to redeem it, is a surefire way to lose a sale and disappoint a customer.

It's vital to shift our focus: moving customers towards a purchase isn't as important as guiding them through their journey to choose the best product for them. This is a fundamental difference between a transactional mindset and a customer-centric approach.

Think of it like a pushy sales assistant in a physical store. Imagine walking in, and immediately, an assistant shoves an item into your hands, puts an arm around your shoulder, and starts steering you towards the checkout. When you haltingly ask, "But is this my size?" they might dismissively respond, "Probably, yes." If you inquire, "Will it perform better than my previous widget?" they might curtly say, "Who cares? Cash or card?" Or if you try one more time, "But is it quieter and more efficient than the old model?" they might simply reply, "Yes. Is the shipping address the same as the billing address?"

While this sales assistant might technically "drag" the customer to the start of the checkout process, what was the true cost? A frustrated, confused, and potentially alienated customer who probably won't complete the purchase and certainly won't return (worse still, they'll tell their friends).

Preparing for the Next Step: The True Purpose of the Funnel

Often, all we accomplish by being overly aggressive or deceptive in the early stages of the funnel is simply shifting the exit point to a later stage in the user's journey. The customer still leaves, but now they're more annoyed because they've invested more time and effort. Instead, it's far more important and beneficial to leverage each step of the customer journey to authentically prepare users for what they'll encounter next. Each stage should provide value to the user, clarify information, and build confidence and trust.  In a situation where different parts of the journey belong to different developers, teams and managers, 

When a user is genuinely informed, reassured, and ready to take the next step—because they understand the value and feel confident in their choice—then and only then, should we make it effortless for them to move forward. This approach fosters trust, reduces abandonment rates, and ultimately leads to more satisfied customers and higher conversion rates in the long run.  We may not offer 'free money', but by subtly pushing users forwards in a path that they aren't fully prepared to take - by providing quicker paths or by removing content that is actually useful to users - we will merely persuade them to move forwards from our page into a situation where they're increasingly likely to leave from the next page.

But still, our next-step page flow metrics show a lovely funnel that shows we moved 10% more users to the next step.  Our analysis, then needs to show how many users move forwards to the next-next step, and then actually completed the full purchase journey.

Consider a simplified conversion funnel, from a landing page through a website to completing the checkout process.


Similar to the 'did you just code a distraction?' question - we see that users move forwards to the next step at a much higher rate.  950/998 = 95%, compared to 75% for control, and 65% for Recipe B.  See, you coded a winner, it just happened to be different than the one you expected!

However, when we look further down the funnel, we find that there's a massive 50% drop-off for Recipe C.  We failed to deliver free money, and people left.  Unsurprisingly, fewer visitors then reach the lower funnel states, and Recipe B is actually the winner (with Recipe C coming behind control).

And if you think that's far-fetched, then perhaps replace 'Free money' with something more believable... like, perhaps, "Add to Basket."  Do customers really get to add the item to basket when they click that button?  Or do they have to select a size, a colour, an upgrade, a guarantee, a warranty or something else first?

Or do you promise things with your CTAs that you aren't really delivering?  "Find out more" needs to show more product information about the product it's connected to.  Getting clicks is easy, but you need to keep an eye on the next step, not just the one you're testing.

Similar posts I've written about online testing




Wednesday, 11 February 2026

Why Too Many Cooks Spoil the A/B Testing Roadmap


Why Too Many Cooks Spoil the A/B Testing Roadmap

How to build a lean, decisive testing program without sacrificing collaboration

When Consensus Kills Velocity

A/B testing is - in theory - designed to answer questions quickly: Which experience performs better? Which idea deserves further research and investment from a budget-conscious company and a stretched development team?

The power of testing lies in short cycles, sharp hypotheses, and decisive execution. Yet in many organizations, the A/B testing roadmap becomes a negotiation table. Every team wants to weigh in, every stakeholder demands review, and the result is predictable: delays, diluted experiments, and very little learning.  The time taken to run a test (i.e. from launch to confidence) starts to pale in comparison to the amount of time it takes to decide what should be in the test.

If your testing program needs a large, multi-functional committee to approve every hypothesis, pick every variant, and sign off every launch, you don’t have a testing program— you have a meeting calendar. It's a painful paradox: a method designed to reduce risk and speed up learning turns into a process that creates risk by slowing decisions and reduces learning by launching compromised tests that fail to answer meaningful questions.

The Ubuntu Concept—and Why Sequence Matters

Ubuntu, often summarized as “I am because we are,” is a powerful philosophy of interdependence, empathy, and shared success. In short, "If you want to go far, travel together; if you want to go fast, travel alone."  There can often be an emphasis on teamwork and consensus, but the truth is, with testing, the fewer people there are, the better.  Testing is such a powerful tool, and it takes so long to develop a test that everybody wants to get involved in deciding what goes in the test, and it has to be pixel perfect.  Tragically, testing's power becomes its downfall - tests take so long because everybody wants to have their input, but they take so long because everybody's input has to be taken into consideration.  There needs to be a time when a leader says, "Enough!"  When somebody takes responsibility and authority, and actually includes fewer people - not more - in the testing process.

The better application is Ubuntu-after-evidence. Before the test, a small, empowered team should define hypotheses, success metrics, and launch criteria with minimal friction. After the test, the broader community can engage—reviewing results, sharing learnings, and co-creating improvements that benefit everyone.

In other words, we preserve Ubuntu’s spirit of shared learning and collective improvement without letting endless pre-test consensus slow down the very mechanism that generates the learning.

Why Fewer People Make Testing Better

A lean testing team is not anti-collaboration. It is pro-clarity. In my experience, the most effective testing programs are lean and decisive because velocity matters above all. A/B testing’s value decays with time. Long approval and planning cycles reduce relevance and increase opportunity cost. Short cycles mean you learn faster, pivot earlier, and deploy compounding gains sooner.

Clear ownership prevents drift. When the roadmap belongs to everyone, it effectively belongs to no one. A focused owner or small squad can prioritize tests based on impact and feasibility, not politics. Committees tend to compromise, merging ideas into multi-variable tests that answer nothing clearly. Lean teams keep hypotheses tight, variants minimal, and interpretations crisp.

Everybody has the opinion that evidence should replace opinion, except where it contradicts what they want to do. A streamlined process moves debates downstream. Instead of arguing which idea is better, we test quickly and gather data, and then prove which idea was better.

Who Should Be in the Core Team?

The lean model works because it assigns clear roles and decision rights. At the center is the Product Manager, who owns the testing backlog and makes go/no-go decisions. They ensure that tests align with business objectives and that prioritization reflects impact rather than politics.

Alongside the PM, or combined in the same role, is the Experimentation or Data Lead, who designs hypotheses, defines success metrics, and ensures statistical rigor. This person calculates sample sizes, sets stopping rules, and interprets results so that every test produces actionable insights.  This role can is a specialized testing role, where a detailed and nuanced understanding of testing, significance, confidence, the testing pipeline, dev capabilities are vital.  They will work closely with the Product Manager when the process of testing has been documented and established.

The Engineering representative plays a critical role in implementing variants, managing feature flags, and ensuring telemetry accuracy. They safeguard performance and security standards, making sure experiments do not compromise the user experience.

The Business Lead works with the team to bring forward tactical business needs, which can feed directly into testing requirements.  Do you need to sell more high-price menswear this quarter?  Are you expanding into audiobooks as well as audio streaming services?  Where, from a business perspective, should we focus our short-term efforts?

UX or Design specialist ensures that variants are user-friendly, accessible, and on-brand within pre-agreed guidelines. Their involvement prevents usability and accessibility issues and maintains consistency without introducing unnecessary complexity.  However, it is absolutely critical that the UX and design specialists support the velocity of the testing roadmap by supplying the assets that are needed for testing without going through their own endless and unnecessary cycles of 'discovery and framing' or 'research', which will single-handedly bring any test roadmap grinding to a complete stop.   

Legal and privacy advisors contribute upfront by defining evergreen guardrails for compliance and data handling. They are consulted during framework design, not as ad hoc gatekeepers for every test.

Finally, stakeholders remain informed and can submit ideas through a structured intake process, but they do not have veto power over launches. Their role is to provide context and consume learnings, not to slow down execution.  Stakeholders have visibility, not veto.

A Common Failure Pattern—and How to Fix It

Some organizations follow a familiar failure pattern. The quarterly roadmap is assembled in a multi-department forum. Hypotheses get neutered to accommodate every stakeholder’s request. Legal, Brand, Product, and Design each add gates. Engineering estimates are based on maximum-complexity variants. There are changes during development, there are changes during pre-launch quality checks.  Scenarios were not considered, and designers Tests are launched late with weak hypotheses and unclear success metrics.  Everybody wanted a slice of the testing cake, and now it's a half-baked mess. Results are inconclusive, and trust in testing erodes.

The fix is straightforward but requires discipline. Empower the core team described above with decision rights for hypothesis selection, prioritization, and launch. Set standard guardrails for brand, legal, and privacy constraints that are pre-agreed and do not require ad hoc approvals. Publish metrics definitions that are standardized and documented. Run short test sprints with limited dependencies. Finally, socialize results broadly through summaries and open Q&A sessions—Ubuntu applied post-evidence.

Conclusion: Ubuntu After Evidence

A/B testing is a learning engine. To keep it running, you must minimize friction at the point of decision and maximize inclusion at the point of reflection. Ubuntu teaches us that we rise together; testing teaches us that we rise faster when we let data lead. So don’t abandon Ubuntu—sequence it. Empower a small team to move quickly within clear guardrails. Then invite the broader community to interpret results, celebrate wins, and turn lessons into shared progress. Fewer people at the decision-making table means more learning for everyone.

Similar posts I've written about online testing


Wednesday, 3 September 2025

Did You Just Code a Distraction?

In the fast-paced world of web development and digital marketing, creating new features to enhance user experience is not just a common practice, it's pretty much par for the course.  Everybody has ideas about making the website better, and it usually involves some sort of magical feature that will help users find the exact product they want in a mere matter of seconds - a shopping genie, or something involving AI.   The new feature for your website is almost always visually appealing, interactive, and the designers are confident it will boost user engagement with their favourite persona. Before rolling it out, you wisely decide to run an A/B test to measure its effectiveness.

So you code the test, and you run it for the usual length of time.  You follow all the advice on LinkedIn about statistical significance (we can all describe it and we all have our own ways of calculating it, thank you) and getting a decent sample size. The test results are in, and they’re a mixed bag. On one hand, the new feature is a hit in terms of engagement. It receives twice as many clicks compared to the other clickable elements on the page, such as those lovely banners, the promotional links and the pretty pictures. However, your  deeper dive into the data reveals a concerning trend. While the new feature attracts a lot of attention and engagement, the conversion rate for users who interact with it is only around 2.5%. In contrast, the conversion rate for users who engage with the existing content on the page is significantly higher, at around 4.1%. 

This is key.  It really is insufficient to look only at engagement data (click through rate) as a success metric.  Yes, it is important, but it is not enough.  After all, if you want to create a banner with a high click rate, then you could simply write "Buy one get one free", or better still, "Buy one get two free - click here."  It's essential that you set expectations with your banners, calls to action and features - what's to stop you from writing "Click here for free money!"?  If your priority in testing is to generate clicks, then you'll degenerate into coding the on-site versions of clickbait, and that's a terrible waste of a potential lead.

So, what went wrong with your test? The short answer is that you coded a pretty distraction. Here’s a breakdown of why this happens and how to address it:

Misalignment with User Intent

The new feature, despite being engaging, may not align with the primary intent of your users. If it diverts their attention away from the main conversion paths, it can reduce overall effectiveness. Users might be intrigued by the new feature but not find it relevant to their immediate needs.  You misunderstood your persona's motivation; it might be time to write your persona with this additional information.

Cognitive Load

Introducing a new element can increase the cognitive load on users. They have to process and understand this new feature, which can be mentally taxing. If the feature doesn’t provide immediate value or clarity, users might get distracted and abandon their original task.  They used up their time, effort and patience while interacting with your new feature, and gave up on their primary purpose (which was to buy something from you).

Disruption of User Flow

A well-designed website guides users smoothly towards conversion goals. A new feature that stands out too much can disrupt this flow, causing users to deviate from their intended path. This disruption can lead to lower conversion rates, as users get sidetracked.  How do I get back to my intended path?  This new feature has proved to be the next big shiny thing, and while it's attracting user engagement, it's confusing them and preventing them from getting to where they wanted to.  

The Solutions

To avoid coding distractions, consider the following strategies:

User-Centric Design

Not my favourite phrase, since it leads to design without A/B testing and designers designing for their favourite personas.  Ensure that any new feature is designed with the user’s needs and goals in mind. Conduct user research to understand what your audience values and how they navigate your site, and then align your new features and your development roadmap with these insights.  This will enhance, rather than disrupt, the user experience, and reduce the amount time of wasted on development the next shiny bauble - it looks nice and impresses senior management, but is not good for users.  

Incremental Testing

Instead of launching a fully-fledged feature, start with a minimal viable version and test its impact incrementally. This approach allows you to gather feedback and make necessary adjustments before a full rollout.  Use test data in conjunction with user research to gain a full picture of what you thought was going to happen, and what actually happened.

Clear Value Proposition

Make sure the new feature has a clear and compelling value proposition. Users should immediately understand its purpose and how it benefits them. This clarity can help integrate the feature seamlessly into the user journey.  If the test 'fails', then you'll learn that the value proposition you promoted was not what users wanted to read, and you can try something different.

Monitor and Iterate

Continuously monitor the performance of new features and be ready to iterate based on user feedback and data. If a feature is not performing as expected, don’t hesitate to tweak or even remove it to maintain a smooth user experience.  It's time to swallow your pride and start again.  If you change direction now, you'll have less distance to travel than if you wait six months before unimplementing the eye-catching blunder you've launched.

Conclusion

In the quest to innovate and improve user engagement, it’s crucial to strike a balance between novelty and functionality. While new features can attract attention, they must also support the primary goals of your website. By focusing on user-centric design, incremental testing, and clear value propositions, you can avoid coding distractions and create features that truly enhance the user experience.

Other Web Analytics and Testing Articles I've Written

How not to segment test data (even if your stakeholders ask you to)

Designing Personas for Design Prototypes
Web Analytics - Gathering Requirements from Stakeholders
Analysis is Easy, Interpretation Less So
Telling a Story with Web Analytics Data
Reporting, Analysing, Testing and Forecasting
Pages with Zero Traffic

Wednesday, 27 August 2025

Do You Know How Well Your Test Will Perform?

 There are various ways of running tests - or more specifically, there are various ways of generating test hypotheses.  One that I've come across over the years, and increasingly so more recently, is the 'guess how well your test is going to perform' approach.  It's not called that, but it seems to me to be the most succinct description.  Apparently, we should already have a target improvement in mind before we even start the test.



"If we change the pictures on our site from cats to dogs, then we'll see a 3.5% increase in conversion."
"If we promote construction toys ahead of action figures, then we'll see a 4% lift in revenue."

If you know that's going to happen, why don't you do it anyway?

The main underlying challenge I have is that it's almost impossible to quantify the improvement you're going to get.  How do you know?

Well, let's attempt the calculation (with hypothetical numbers all the way through).

Let's say our latest campaign landing page has a bounce rate (user lands on page, then exits without visiting any other pages) of 75%.  10% engage with site search, 10% click on the menus at the top of the page, and 5% click on the content on the page (there are a few banners and a few links).

We've identified that most users aren't scrolling past the first set of banners and links, and we therefore hypothesise that if we make the banners smaller, and reduce the amount of padding around the links, that we can increase engagement with the content in the lower half of the page, and therefore improve the bounce rate.  We believe we can get 50% more links above the fold, and therefore increase the in-page engagement rate from 5% to 7.5%.  We will assume (and this is the fun bit) that this additional traffic converts at the same rate as the 5% we have so far, and therefore, we'll get a revenue lift of 50%.  This sounds like a lot, but given that the engagement rate is going up from a small number to a slightly larger number, it's unlikely to be a huge revenue lift in dollar terms (unless you're pouring in huge volumes of traffic - and watching it bounce at a rate of 75%).

Perhaps that was an over-simplification.  But if we knew that our test will give us a 5% lift (and we've still decided to test it), what happens when we launch the test?  Presumably, we'll stop it when it reaches the 5% lift, irrespective of the confidence level.  But what happens if it doesn't get to 5%?  What if it stubbornly sits at 4%?  Or maybe just 3%?  Did the test win, or did it lose?  In classical scientific terms, it lost, since we disproved our overly-specific hypothesis.  But from a business perspective, it still won, just not by as much as we had originally expected.  Would you go into a meeting with the marketing manager and say, "Sorry, Jim, our test only achieved a 3% revenue lift, so we've decided it was a failure."?

For me, it comes down to two arguments: 

If you can forecast your test result with a high degree of certainty, based on considerable evidence for your hypothesis, it's probably not worth testing and you should implement already.  Testing is best used for edge-cases with some degree of uncertainty. 

If, on the other hand, you have identified a customer problem with your site, and you can see that fixing it will give you a revenue lift - but you don't know how to fix it - then that's very good grounds for testing.  The hypothesis is not, "If we fix this problem, we'll get a 6% revenue lift," but, "If we fix this problem in this way then we'll get a revenue lift".  And that's where you need to encourage the website analysts and the customer feedback department (or the complaints department, or whoever advocates for customers within your company) to come together and find out where the problems are, and what they are, and how to address them.

That will undoubtedly bring good test ideas, and that's what you're looking for, even if you don't know how much revenue lift it will provide.

Other Web Analytics and Testing Articles I've Written

How not to segment test data (when your stakeholders want you to adapt your data)
Web Analytics - Gathering Requirements from Stakeholders
Analysis is Easy, Interpretation Less So (and why it's more valuable)
Telling a Story with Web Analytics Data
Reporting, Analysing, Testing and Forecasting
Pages with Zero Traffic


Sunday, 24 November 2024

Testing versus Implementing - why not just switch it on?

"Why can't we just make a change and see what happens? Why do we have to build an A/B test - it takes too long!  We have a roadmap, a pipeline and a backlog, and we haven't got time."

It's not always easy to articulate why testing is important - especially if your company is making small, iterative, data-backed changes to the site and your tests consistently win (or, worse still, go flat).  The IT team is testing carefully and cautiously, but the time taken to build the test and run it is slowing down everybody's pipelines.  You work with the IT team to build the test (which takes time), it runs (which takes even more time), you analyze the test (why?) and you show that their good idea was indeed a good idea.  Who knew?


Ask an AI what a global IT roadmap looks like...

However, if your IT team is building and deploying something to your website - a new way of identifying a user's delivery address; or a new way of helping users decide which sparkplugs or ink cartridges or running shoes they need - something new, innovative and very different, then I would strongly recommend that you test it with them, even if there is strong evidence for its effectiveness.  Yes, they have carried out user-testing and it's done well.  Yes, their panel loved it.  Even the Head of Global Synergies liked it, and she's a tough one to impress.  Their top designers have spent months in collaboration with the project manager, and their developers have gone through the agile process so many times that they're as flexible as ballet dancers.  They've barely reached the deadline for pre-Christmas implementation, and now is the time to implement it.  It is ready.  However, the Global Integration Leader has said that they must test before they launch, but that's okay as they have allocated just enough time for a pre-launch A/B test, then they'll go live as soon as the test is complete.


Sarah Harries, Head of Global Synergies

Everything hinges on the test launching on time, which it does.  Everybody in the IT team is very excited to see how users engage with the new sparkplug selection tool and - more importantly for everybody else - how much it adds to overall revenue.  (For more on this, remember that clicks aren't really KPIs). 

But the test results come back: you have to report that the test recipe is underperforming at a rate of 6.3% conversion drop.  Engagement looks healthy at 11.7%, but those users are dragging down overall performance.  The page exit rate is lower, but fewer users are going through checkout and completing a purchase.  Even after two full weeks, the data is looking negative.  

Can you really recommend implementing the new feature?  No; but that's not the end of the story.  It's your job to now unpick the data, and turn analysis into insights:  why didn't it win?!

The IT team, understandably, want to implement.  After all, they've spent months building this new selector and the pre-launch data was all positive.  The Head of Global Synergies is asking them why it isn't on the site yet.  Their timeline allowed three weeks for testing and you've spent three weeks testing.  Their unspoken assumption was that testing was a validation of the new design, not a step that might turn out to be a roadblock, and they had not anticipated any need for post-test changes.  It was challenging enough to fit in the test, and besides, the request was to test it.

It's time to interrogate the data.

Moreover, they have identified some positive data points:

*  Engagement is an impressive 11.7%.  Therefore, users love it.
*  The page exit rate is lower, so more people are moving forwards.  That's all that matters for this page:  get users to move forwards towards checkout.
*  The drop in conversion is coming from the pages in the checkout process.  That can't be related to the test, which is in the selector pages.  It must be a checkout problem.

They question the accuracy of the test data, which contradicts all their other data.

* The sample size is too small.
* The test was switched off before it had a chance to recover its 6.3% drop in conversion

They suggest that the whole A/B testing methodology is inaccurate.

* A/B testing is outdated and unreliable.  
* The split between the two groups wasn't 50-50.  There are 2.2% more visitors in A than B.

Maybe they'll comment that the data wasn't analyzed or segmented correctly, and they make some points about this:

* The test data includes users buying other items with their sparkplugs.  These should be filtered out.
* The test data must have included users who didn't see the test experience.
* The data shows that users who browsed on mobile phones only performed at -5.8% on conversion, so they're doing better than desktop users.

Remember:  none of this is personal.  You are, despite your best efforts, criticising a project that they've spent weeks or even months polishing and producing.  Nobody until this point has criticised their work, and in fact everybody has said how good it is.  It's not your fault, your job is to present the data and to provide insights based on it.  As a testing professional, your job is to run and analyse tests, not to be swayed into showing the data in a particular way.

They ran the test at the request of the Global Integration Leader, and burnt three weeks  waiting for the test to complete.  The deadline for implementing the new sparkplug selector is Tuesday, and they can't stop the whole IT roadmap (which is dependent on this first deployment) just because one test showed some negative data.  They would have preferred not to test it at all, but it remains your responsibility to share the test data with other stakeholders in the business, marketing and merchandizing teams, who have a vested interest in the site's financial performance.  It's not easy, but it's still part of your role to present the unbiased, impartial data that makes up your test analysis, along with the data-driven recommendations for improvements.

It's not your responsibility to make the go/no-go decision, but it is up to you to ensure that the relevant stakeholders and decision-makers have the full data set in front of them when they make the decision.  They may choose to implement the new feature anyway, taking into account that it will need to be fixed with follow-up changes and tweaks once it's gone live.  It's a healthy compromise, providing that they can pull two developers and a designer away from the next item on their roadmap to do retrospective fixes on the new selector.  
Alternatively, they may postpone the deployment and use your test data to address the conversion drops that you've shared.  How are the conversion drop and the engagement data connected?  Is the selector providing valid and accurate recommendations to users?  Does the data show that they enter their car colour and their driving style, but then go to the search function when they reach a question about their engine size?  Is the sequence of questions optimal?  Make sure that you can present these kinds of recommendations - it shows the value of testing, as your stakeholders would not be able to identify these insights from an immediate implementation.

So - why not just switch it on?  Here are four good reasons to share with your stakeholders:

* Test data will give you a comparison of whole-site behaviour - not just 'how many people engaged with the new feature?' but also 'what happens to those people who clicked?' and 'how do they compare with users who don't have the feature?'
* Testing will also tell you about  the financial impact of the new feature (good for return-on-investment calculations, which are tricky with seasonality and other factors to consider)
*  Testing has the key benefit that you can switch it off - at short notice, and at any time.  If the data shows that the test recipe is badly losing money then you identify this, and after a discussion with any key stakeholders, you can pull the plug within minutes.  And you can end the test at any time - you don't have to wait until the next IT deployment window to undeploy the new feature. 
* Testing will give you useful data quickly - within days you'll see how it's performing; within weeks you'll have a clear picture.




Monday, 18 November 2024

Designing Personas for Design Prototypes

Part of my job is validating (i.e. testing and confirming) new designs for the website I work on.  We A/B test the current page against a new page, and confirm (or otherwise) that the new version is indeed better than what we have now.  It's often a last-stop measure before the new design is implemented globally, although it's not always a go/no-go decision.

The new design has gone through various other testing and validation first - a team of qualified user experience designers (UX)  and user interface designers (UI) will have decided how they want to improve the current experience.  They will have undertaken various trials with their designs, and will have built prototypes that will have been shown to user researchers; one of the key parts of the design process, somewhere near the beginning, is the development of user personas.

A persona in this context is a character that forms a 'typical user', who designers and product teams can keep in mind while they're discussing their new design.  They can point to Jane Doe and say, "Jane would like this," or, "Jane would probably click on this, because Jane is an expert user."

I sometimes play Chess in a similar way, when I play solo Chess or when I'm trying to analyze a game I'm playing.  I make a move, and then decide what my opponent would play.  I did this a lot when I was a beginner, learning to play (about 40 years ago) - if I move this piece, then he'll move that piece, and I'll move this piece, and I'll checkmate him in two moves!  This was exactly the thought process I would go through - making the best moves for me, and then guessing my opponent's next move.


It rarely worked out that way, though, when I played a real game.  Instead, my actual opponent would see my plans, make a clever move of his own and capture my key piece before I got chance to move it within range of his King.


Underestimating (or, to quote a phrase, misunderestimating) my opponent's thoughts and plans is a problem that's inherent with playing skill and strategy games like Chess.  In my head, my opponent can only play as well as I can. 

However, when I play solo, I can make as many moves as I like, but both sides can do whatever I like, and I can win because I constructed my opponent to follow the perfect sequence of moves to let me win.  And I can even fool myself into believing that I won because I had the better ideas and the best strategy.

And this is a common pitfall among Persona Designers (I've written a whole series on the pitfalls of A/B testing).  They impose too much of their own character onto their persona, and suddenly they don't have a persona, they have a puppet.

"Jane Doe is clever enough to scroll through the product specifications to find the compelling content that will answer all her questions."

"Joe Bloggs is a novice in buying jewellery for his wife, so he'll like all these pretty pictures of diamonds."

"John Doe is a novice buyer who wants a new phone and needs to read all this wonderful content that we've spent months writing and crafting."

This is something similar to the Texas Sharpshooter Fallacy (shooting bullets at the side of a barn, then painting the target around them to make the bullet holes look like they hit it).  That's all well and good, until you realize that the real customers who will spend real money purchasing items from our websites, have a very real target that's not determined by where we shoot our bullets.  We might even know the demographics of our customers, but even that doesn't mean we know what (or how) they think.  We certainly can't imbue our personas with characters and hold on to them as firmly as we do in the face of actual customer buying data that shows a different picture.  So what do we do?



"When the facts change, I change my mind. What do you do, sir?"
Paul Samuelson, Economist,1915-2009


Wednesday, 10 July 2024

How not to Segment Test Data

 Segmenting Test Data Intelligently

Sometimes, a simple 'did it win?' will provide your testing stakeholders with the answer they need. Yes, conversion was up by 5% and we sold more products than usual, so the test recipe was clearly the winner.  However, I have noticed that this simple summary is rarely enough to draw a test analysis to a close.  There are questions about 'did more people click on the new feature?' and 'did we see better performance from people who saw the new banner?'.  There are questions about pathing ('why did more people go to the search bar instead of going to checkout?') and there are questions about these users.  Then we can also provide all the in-built data segments from the testing tool itself.  Whichever tool you use, I am confident it will have new vs return users; users by geographic region; users by traffic source; by landing page; by search term... any way of segmenting your normal website traffic data can be unleashed onto your test data and fill up those slides with pie charts and tables.

After all, segmentation is key, right?  All those out-of-the-box segments are there in the tool because they're useful and can provide insight.

Well, I would argue that while they can provide more analysis, I'm not sure about more insights (as I wrote several years ago).  And I strongly suspect that the out-of-the-box segments are there because they were easy to define and apply back when website analytics was new.  Nowadays, they're there because they've always been there,  and because managers who were there at the dawn of the World Wide Web have come to know and love them (even if they're useless.  The metrics, not the managers).

Does it really help to know that users who came to your site from Bing performed better in Recipe B versus Recipe A?  Well, it might - if the traffic profile during the test run was typical for your site.  If it is, then go ahead and target Recipe B for users who came from Bing.  And please ask your data why the traffic from Bing so clearly preferred Recipe B (don't just leave it at that).

Visitors from Bing performed better in Recipe B?  So what?

Is it useful to know that return users performed better in Recipe C compared to Recipe A?

Not if most of your users make a purchase on their first visit:  they browse the comparison sites, the expert review sites and they even look on eBay, and then they come to your site and buy on their first visit.  So what if Recipe C was better for return users?  Most of your users purchase on their first visit, and what you're seeing is a long-tail effect with a law of diminishing returns.  And don't let the argument that 'All new users become return users eventually' sway you.  Some new users just don't come back - they give up and don't try again.  In a competitive marketplace where speed, efficiency and ease-of-use are now basic requirements instead of luxuries, if your site doesn't work on the first visit, then very few users will come back - they'll find somewhere easier instead.  

And, and, and:  if return users perform better, then why?  Is it because they've had to adjust to your new and unwieldy design?  Did they give up on their first visit, but decide to persevere with it and come back for more punishment because the offer was better and worth the extra effort?  This is hardly a compelling argument for implementing Recipe C.  (Alternatively, if you operate a subscription model, and your whole website is designed and built for regular return visitors, you might be on to something).  It depends on the size of the segments.  If a tiny fraction of your traffic performed better, then that's not really helpful.  If a large section of your traffic - a consistent, steady source of traffic - performed better, then that's worth looking at.

So - how do we segment the data intelligently?

It comes back to those questions that our stakeholders ask us: "How many people clicked?" and "What happened to the people who clicked, and those who didn't?"  These are the questions that are rarely answered with out-of-the-box segments.  "Show me what happened to the people who clicked and those who didn't" leads to answers like, "We should make this feature more visible because people who clicked it converted at a 5% higher rate." You might get the answer that, "This feature gained a very high click rate, but made no impact [or had a negative effect] on conversion." This isn't a feature: it's a distraction, or worse, a roadblock.

The best result is, "People who clicked on this feature spent 10% more than those who didn't."

And - this is more challenging but also more insightful - what about people who SAW the new feature, but didn't click?  We get so hung up on measuring clicks (because clicks are the currency of online commerce) that we forget that people don't read with their mouse button.  Just because somebody didn't click on the message doesn't mean they didn't see it: they saw it and thought, "Not interesting," "not relevant" or "Okay, that's good to know but I don't need to learn more".  The message that says, "10% off with coupon code SAVETEN - Click here for more" doesn't NEED to be clicked.  And ask yourself "Why?" - why are they clicking, why aren't they?  Does your message convey sufficient information without further clicking, or is it just a headline that introduces further important content.  People will rarely click Terms and Conditions links, after all, but they will have seen the link.

We forget that people don't read with their mouse button.

So we're going to need to have a better understanding of impressions (views) - and not just at a page level, but at an element level.  Yes, we all love to have our messages, features and widgets at the top of the page, in what my high school Maths teacher called "Flashing Red Ink".  However, we also have to understand that it may have to be below the fold, and there, we will need to get a better measure of how many people actually scrolled far enough to see the message - and then determine performance for those people.  Fortunately, there's an abundance of tools that do this; unfortunately, we may have to do some extra work to get our numerators and denominators to align.  Clicks may be currency, but they don't pay the bills.

So:  segmentation - yes.  Lazy segmentation - no.

Other articles I've written on Website Analytics that you may find relevant:

Web Analytics - Gathering Requirements from Stakeholders
Analysis is Easy, Interpretation Less So - when to segment, and how.
Telling a Story with Web Analytics Data - how to explain your data in a clear way
Reporting, Analysing, Testing and Forecasting - the differences, and how to do them well
Pages with Zero Traffic - identifying which pages you're wasting effort on.

Friday, 17 May 2024

Multi-Armed Bandit Testing

 I have worked in A/B testing for over 12 years, and blogged about it extensively.  I've covered how to set up a hypothesis, how to test iteratively and even summarized the basics of A/B testing.  I ran my first A/B test on my own website (long since deleted and now only in pieces on a local hard-drive) about 14 years ago.  However, it has taken me this long to actually look into other ways of running online A/B tests apart from the equal 50-50 split that we all know and love.

My recent research led me to discover multi-armed bandit testing, which sounds amazing, confusing and possibly risky (don't bandits wear black eye-masks and operate outside the law??). 

What is multi-armed bandit testing?

The term multi-armed bandit comes from a mathematical problem, which can be phrased like this:

A gambler must choose between multiple slot machines, or "one-armed bandits", each which has a different, unknown, likelihood of winning. The aim is to find the best or most profitable outcome by a series of choices. At the beginning of the experiment, when odds and payouts are unknown, the gambler must try each one-armed bandit to measure their payout rate, and then find a strategy to maximize winnings.  


Over time, this will mean putting more money into the machine(s) which provide the best return.

Hence, the multiple one-armed bandits make this the “multi-armed bandit problem,” from which we derive multi-armed bandit testing.

The solution - to put more money into the machine which returns the best prizes most often - translates to online testing:, the testing platform dynamically changes the allocation of new test visitors to the recipes which are showing the best performance so far.  Normally, traffic is allocated randomly between the recipes, but with multi-armed bandit testing traffic is skewed towards the winning recipe(s).  Instead of the normal 50-50 split (or 25-25-25-25, or whichever), the traffic splits on a daily (or by visit) day.  

We see two phases of traffic distribution while the test is running:  initially, we have the 'exploration' phase, where the platform tests and learns, measuring which recipe(s) are providing the best performance (insert your KPI here).  After a potential winner becomes apparent, the percentage of traffic to that recipe starts to increase, while the losers see less and less traffic.  Eventually, the winner will see the vast majority of traffic - although the platform will continue to send a very small proportion of traffic to the losers, to continue to validate its measurements, and this is the 'exploitation' phase.

The graph for the traffic distribution over time may look something like this:


...where Recipe B is the winner.

So, why do a multi-armed bandit test instead of a normal A/B test?

If you need to test, learn and implement in a short period of time, then multi-armed may be the way forwards.  For example, if marketing want to know which of two or three banners should accompany the current sales campaign (back to school; Labour Day; holiday weekend), you aren't going to have time to run the test, analyze the results and push the winner.  The campaign ended while you were tinkering with your spreadsheets.  With multi-armed bandit, the platform identifies the best recipes while the test is running, and implements it while the campaign is still active.  When the campaign has ended, you will have maximized your sales performance by showing the winner while the campaign was active.

Wednesday, 10 January 2024

Statistics: Type 1 and Type 2 Errors

 In statistics (and by extension, in A/B testing), a Type I error is a false positive conclusion (we think a test recipe won when it didn't), while a Type II error is a false negative conclusion (we think the test recipe lost, when it didn't).  

Making a statistical decision always involves uncertainties, because we're sampling instead of looking at the whole population.  This means the risks of making these errors are unavoidable in hypothesis testing - we don't know everything because we can't measure everything.  However, that doesn't mean we don't know anything - it just means we need to understand what we do and don't know.


The probability of making a Type I error is the significance level, or alpha (α), while the probability of making a Type II error is beta (β).  Incidentally, the statistical power of a test is measured by 1- β.  I'll be looking at the statistical power of a test in a future blog.

These risks can be minimized through careful planning in your test design.

To reduce Type 1 errors, which mean falsely rejecting the null hypothesis - and calling a winner when the results were flat - it is crucial to choose an appropriate significance level and stick to it. Being cautious when interpreting results and also considering what the findings mean may also help mitigate Type 1 errors.  Different companies have different significance levels that they use when testing, depending on how cautious or ambitious they want to be with their testing program.  If there are millions of dollars at risk per year, or developing a new site or design will cost months of work, then adopting a higher significance level (90% or higher) may be the order of the day.  Conversely, if you're a smaller operator with less traffic, or a change that can be easily unpicked if things don't go as expected, then you could use a lower significance level (80% or higher).

It's worth saying at this point that human beings are lousy at understanding and interpreting probabilities, and that's generally.  Confidence levels and probabilities are related but are not directly interchangeable.  The difference in confidence between 90% and 80% is not the same as between 80% and 70%.  It becomes more and more 'difficult' to increase a confidence level as you approach 100% confidence.  After all, can you really say something is 100% certain to happen when you've only taken a sample (even if it's a really large sample)?  On the other hand, it's easy to the point of inevitable that a small sample can give you a 50% confidence level.  What did you prove?  That a coin is equally likely to give you heads or tails?


 Type 2 errors can be minimised by using high levels of statistical significance, or (unsurprisingly) by using a larger sample size.  The sample size determines the degree of sampling error, which in turn sets the ability to detect the differences in a hypothesis test. A larger sample size increases the chances to capture the differences in the statistical tests, and also increases a test's power. 

Practically speaking, Type 1 and Type 2 errors (false positives and false negatives) are an inherent feature of A/B testing, and the best ways to minimize them is to have a pre-agreed minimum sample size, and a pre-determined confidence level that everyone (business teams, marketing, testing team) are all agreed on.  Otherwise, there'll be discussions and debates afterwards about what's a winner, what's confident, what's significant and what's actually a winner.  


Monday, 14 November 2022

How many of your tests win?

 As November heads towards December, and the end of the calendar year approaches, we start the season of Annual Reviews.  It's time to identify, classify and quantify our successes and failures opportunities from 2022, and to look forward to 2023.  For a testing program, this usually involves the number of tests we've run, and how many recipes were involved; how much money we made and how many of our tests were winners.

If I ask you, I don't imagine you'd tell me, but consider for a moment:  how many of your tests typically win?  How many won this year?  Was it 50%?  Was it 75%?  Was it 90%?  And how does this reflect on your team's performance?

50% or less

It's probably best to frame this as 'avoiding revenue loss'.  Your company tested a new idea, and you prevented them from implementing it, thereby saving your company from losing a (potentially quantifiable) sum of money.  You were, I guess, trying some new ideas, and hopefully pushed the envelope - in the wrong direction, but it was probably worth a try.  Or maybe this shows that your business instincts are usually correct - you're only testing the edge cases.

Around 75%

If 75% of your tests are winning, then you're in a good position and probably able to start picking and choosing the tests that are implemented by your company.  You'll have happy stakeholders who can see the clear incremental revenue that you're providing, and who can see that they're having good ideas.

90% or more

If you're in this apparently enviable position, you are quite probably running tests that you shouldn't be.  You're probably providing an insurance policy for some very solid changes to your website; you're running tests that have such strong analytical support, clear user research or customer feedback behind them that they're just straightforward changes that should be made.  Either that, or your stakeholders are very lucky, or have very good intuition about the website.  No, seriously ;-)

Your win rate will be determined by the level of risk or innovation that your company are prepared to put into their tests.  Are you testing small changes, well-backed by clear analytics?  Should you be?  Or are you testing off-the-wall, game-changing, future-state, cutting edge designs that could revolutionise the online experience? 

I've said before that your test recipes should be significantly different from the current state - different enough to be easy to distinguish from control, and to give you a meaningful delta.  That's not to say that small changes are 'bad', but if you get a winner, it will probably take longer to see it.

Another thought:  the win rate is determined by the quality of the test ideas, and how adventurous the ideas are, and therefore the win rate is a measure of the teams who are driving the test ideas.  If your testing team is focused on test ideas and has strengths in web analytics and customer experience metrics, then your team will probably have a high win rate.  Conversely, if your team is responsible for the execution of test ideas which are produced by other teams, then a measure of test quality will be on execution, test timing, and quantity of the tests you run.  You can't attribute the test win rate (high or low) to a team who develop tests; in fact, the quality of the code is a much better KPI.

What is the optimal test win rate?  I'm not sure that there is one, but it will certainly reflect the character of your test program more than its performance. 

Is there a better metric to look at?   I would suggest "learning rate":  how many of your tests taught you something? How many of them had a strong, clearly-stated hypothesis that was able to drive your analysis of your test (winner or loser) and lead you to learn something about your website, your visitors, or both?  Did you learn something that you couldn't have identified through web analytics and path analysis?  Or did you just say, "It won", or "It lost" and leave it there?  Was the test recipe so complicated, or contain so many changes, that isolating variables and learning something was almost completely impossible?

Whatever you choose, make sure (as we do with our test analysis) that the metric matches the purpose, because 'what gets measured gets done'.

Similar posts I've written about online testing

Getting an online testing program off the ground
Building Momentum in Online testing
Testing vs Implementing Directly


Thursday, 25 August 2022

Testing Towards The Future State

Once or twice in the past, I've talked about how your testing program needs to align with various departments in your company if it's going to build momentum.  For example, you need to test a design that's approved by your site design and branding teams (bright orange CTA buttons might be a big winner for you, but if your brand colour is blue, you're not going to get very far).  

Or what happens if you test a design that wins but isn't approved by the IT team - they just aren't heading towards Flash animations and video clips, and they're going to start using 360-degree interactive images?  The answer - you compiled and coded a very complicated dead-end.

But what about the future state of your business model?  Are you trying to work out the best way to promote your best-selling product?  Are you testing whether showing discounts as £s off or % off?  This kind of testing assumes that pricing is important, but take a look at The Rolls Royce website which doesn't have any price information on it at all.  Scary, isn't it?  But apparently that's what a luxury brand looks like (and for a second example, try this luxury restaurant guide).

  Apart from sharing the complicated and counter-intuitive navigation of the Rolls Royce site, it also shares a distinct lack of price information.  Even the sorting and filtering excludes any kind of sorting by price - it's just not there.

So, if you're testing the best way of showing price information on your site while the business as a whole is moving towards a luxury status, then it's time to start rethinking your testing program and moving into line with the business.

Conversely, if you're moving your business model towards the mainstream audience in order to increase volumes, then it's time to start looking at pricing (for example) and making your site simpler, less ethereal and less vague, with content that's focused more on the actual features and benefits of the product, and less on the lifestyle.  Take, for example, the luxury perfume adverts that proliferate in the run-up to Christmas.  You can't convey a smell on television, or online, so instead we get these abstract adverts with people dancing on the moon; bathing in golden liquid or whatever, against a backdrop of classical music.  Does it tell you the price?  Does it tell you what it smells like?  In some cases, does it even tell you what the product is called?  Okay, it usually does, but it's a single word at the end, which they say out loud so you know how to pronounce it when you go shopping on the high street.

Compare those with, for example, toy adverts.  Simple, bright, noisy, clear images of the product, repetition of the brand and product name and with the prices (recommended retail price) running constantly throughout, and at the end.  Yes, there are legal requirements regarding toy adverts, but even so, no-one would ever think of a toy as a premium. Yet somehow, toys sell extremely well year after year, whether cheap or expensive, new or established brand.

So, make sure your testing is in line with business goals - not just KPIs, but the wider business strategy, branding and positioning. Don't go testing price presentation if the prices are being removed from your site; don't test colours of buttons which contravene your marketing guidelines for a classy monochrome site, and so on. Business goals are not always financial, so keep in touch with marketing!