Web Optimisation, Maths and Puzzles: testing

Showing posts with label testing. Show all posts

Sunday, 24 November 2024

Testing versus Implementing - why not just switch it on?

"Why can't we just make a change and see what happens? Why do we have to build an A/B test - it takes too long! We have a roadmap, a pipeline and a backlog, and we haven't got time."

It's not always easy to articulate why testing is important - especially if your company is making small, iterative, data-backed changes to the site and your tests consistently win (or, worse still, go flat). The IT team is testing carefully and cautiously, but the time taken to build the test and run it is slowing down everybody's pipelines. You work with the IT team to build the test (which takes time), it runs (which takes even more time), you analyze the test (why?) and you show that their good idea was indeed a good idea. Who knew?

Ask an AI what a global IT roadmap looks like...

However, if your IT team is building and deploying something to your website - a new way of identifying a user's delivery address; or a new way of helping users decide which sparkplugs or ink cartridges or running shoes they need - something new, innovative and very different, then I would strongly recommend that you test it with them, even if there is strong evidence for its effectiveness. Yes, they have carried out user-testing and it's done well. Yes, their panel loved it. Even the Head of Global Synergies liked it, and she's a tough one to impress. Their top designers have spent months in collaboration with the project manager, and their developers have gone through the agile process so many times that they're as flexible as ballet dancers. They've barely reached the deadline for pre-Christmas implementation, and now is the time to implement it. It is ready. However, the Global Integration Leader has said that they must test before they launch, but that's okay as they have allocated just enough time for a pre-launch A/B test, then they'll go live as soon as the test is complete.

Sarah Harries, Head of Global Synergies

Everything hinges on the test launching on time, which it does. Everybody in the IT team is very excited to see how users engage with the new sparkplug selection tool and - more importantly for everybody else - how much it adds to overall revenue. (For more on this, remember that clicks aren't really KPIs).

But the test results come back: you have to report that the test recipe is underperforming at a rate of 6.3% conversion drop. Engagement looks healthy at 11.7%, but those users are dragging down overall performance. The page exit rate is lower, but fewer users are going through checkout and completing a purchase. Even after two full weeks, the data is looking negative.

Can you really recommend implementing the new feature? No; but that's not the end of the story. It's your job to now unpick the data, and turn analysis into insights: why didn't it win?!

The IT team, understandably, want to implement. After all, they've spent months building this new selector and the pre-launch data was all positive. The Head of Global Synergies is asking them why it isn't on the site yet. Their timeline allowed three weeks for testing and you've spent three weeks testing. Their unspoken assumption was that testing was a validation of the new design, not a step that might turn out to be a roadblock, and they had not anticipated any need for post-test changes. It was challenging enough to fit in the test, and besides, the request was to test it.

It's time to interrogate the data.

Moreover, they have identified some positive data points:

* Engagement is an impressive 11.7%. Therefore, users love it.

* The page exit rate is lower, so more people are moving forwards. That's all that matters for this page: get users to move forwards towards checkout.

* The drop in conversion is coming from the pages in the checkout process. That can't be related to the test, which is in the selector pages. It must be a checkout problem.

They question the accuracy of the test data, which contradicts all their other data.

* The sample size is too small.

* The test ran for too long/did not run for long enough

* The test was switched off before it had a chance to recover its 6.3% drop in conversion

They suggest that the whole A/B testing methodology is inaccurate.

* A/B testing is outdated and unreliable.

* The split between the two groups wasn't 50-50. There are 2.2% more visitors in A than B.

Maybe they'll comment that the data wasn't analyzed or segmented correctly, and they make some points about this:

* The test data includes users buying other items with their sparkplugs. These should be filtered out.

* The test data must have included users who didn't see the test experience.

* The data shows that users who browsed on mobile phones only performed at -5.8% on conversion, so they're doing better than desktop users.

Remember: none of this is personal. You are, despite your best efforts, criticising a project that they've spent weeks or even months polishing and producing. Nobody until this point has criticised their work, and in fact everybody has said how good it is. It's not your fault, your job is to present the data and to provide insights based on it. As a testing professional, your job is to run and analyse tests, not to be swayed into showing the data in a particular way.

They ran the test at the request of the Global Integration Leader, and burnt three weeks waiting for the test to complete. The deadline for implementing the new sparkplug selector is Tuesday, and they can't stop the whole IT roadmap (which is dependent on this first deployment) just because one test showed some negative data. They would have preferred not to test it at all, but it remains your responsibility to share the test data with other stakeholders in the business, marketing and merchandizing teams, who have a vested interest in the site's financial performance. It's not easy, but it's still part of your role to present the unbiased, impartial data that makes up your test analysis, along with the data-driven recommendations for improvements.

It's not your responsibility to make the go/no-go decision, but it is up to you to ensure that the relevant stakeholders and decision-makers have the full data set in front of them when they make the decision. They may choose to implement the new feature anyway, taking into account that it will need to be fixed with follow-up changes and tweaks once it's gone live. It's a healthy compromise, providing that they can pull two developers and a designer away from the next item on their roadmap to do retrospective fixes on the new selector. Alternatively, they may postpone the deployment and use your test data to address the conversion drops that you've shared. How are the conversion drop and the engagement data connected? Is the selector providing valid and accurate recommendations to users? Does the data show that they enter their car colour and their driving style, but then go to the search function when they reach a question about their engine size? Is the sequence of questions optimal? Make sure that you can present these kinds of recommendations - it shows the value of testing, as your stakeholders would not be able to identify these insights from an immediate implementation.

So - why not just switch it on? Here are four good reasons to share with your stakeholders:

* Test data will give you a comparison of whole-site behaviour - not just 'how many people engaged with the new feature?' but also 'what happens to those people who clicked?' and 'how do they compare with users who don't have the feature?'

* Testing will also tell you about the financial impact of the new feature (good for return-on-investment calculations, which are tricky with seasonality and other factors to consider)

* Testing has the key benefit that you can switch it off - at short notice, and at any time. If the data shows that the test recipe is badly losing money then you identify this, and after a discussion with any key stakeholders, you can pull the plug within minutes. And you can end the test at any time - you don't have to wait until the next IT deployment window to undeploy the new feature.

* Testing will give you useful data quickly - within days you'll see how it's performing; within weeks you'll have a clear picture.

Similar posts I've written about online testing

Getting an online testing program off the ground
Building Momentum in Online testing
How many of your tests win?

Monday, 18 November 2024

Designing Personas for Design Prototypes

Part of my job is validating (i.e. testing and confirming) new designs for the website I work on. We A/B test the current page against a new page, and confirm (or otherwise) that the new version is indeed better than what we have now. It's often a last-stop measure before the new design is implemented globally, although it's not always a go/no-go decision.

The new design has gone through various other testing and validation first - a team of qualified user experience designers (UX) and user interface designers (UI) will have decided how they want to improve the current experience. They will have undertaken various trials with their designs, and will have built prototypes that will have been shown to user researchers; one of the key parts of the design process, somewhere near the beginning, is the development of user personas.

A persona in this context is a character that forms a 'typical user', who designers and product teams can keep in mind while they're discussing their new design. They can point to Jane Doe and say, "Jane would like this," or, "Jane would probably click on this, because Jane is an expert user."

I sometimes play Chess in a similar way, when I play solo Chess or when I'm trying to analyze a game I'm playing. I make a move, and then decide what my opponent would play. I did this a lot when I was a beginner, learning to play (about 40 years ago) - if I move this piece, then he'll move that piece, and I'll move this piece, and I'll checkmate him in two moves! This was exactly the thought process I would go through - making the best moves for me, and then guessing my opponent's next move.

It rarely worked out that way, though, when I played a real game. Instead, my actual opponent would see my plans, make a clever move of his own and capture my key piece before I got chance to move it within range of his King.

Underestimating (or, to quote a phrase, misunderestimating) my opponent's thoughts and plans is a problem that's inherent with playing skill and strategy games like Chess. In my head, my opponent can only play as well as I can.

However, when I play solo, I can make as many moves as I like, but both sides can do whatever I like, and I can win because I constructed my opponent to follow the perfect sequence of moves to let me win. And I can even fool myself into believing that I won because I had the better ideas and the best strategy.

And this is a common pitfall among Persona Designers (I've written a whole series on the pitfalls of A/B testing). They impose too much of their own character onto their persona, and suddenly they don't have a persona, they have a puppet.

"Jane Doe is clever enough to scroll through the product specifications to find the compelling content that will answer all her questions."

"Joe Bloggs is a novice in buying jewellery for his wife, so he'll like all these pretty pictures of diamonds."

"John Doe is a novice buyer who wants a new phone and needs to read all this wonderful content that we've spent months writing and crafting."

This is something similar to the Texas Sharpshooter Fallacy (shooting bullets at the side of a barn, then painting the target around them to make the bullet holes look like they hit it). That's all well and good, until you realize that the real customers who will spend real money purchasing items from our websites, have a very real target that's not determined by where we shoot our bullets. We might even know the demographics of our customers, but even that doesn't mean we know what (or how) they think. We certainly can't imbue our personas with characters and hold on to them as firmly as we do in the face of actual customer buying data that shows a different picture. So what do we do?

"When the facts change, I change my mind. What do you do, sir?"
Paul Samuelson, Economist,1915-2009

Wednesday, 10 July 2024

How not to Segment Test Data

Segmenting Test Data Intelligently

Sometimes, a simple 'did it win?' will provide your testing stakeholders with the answer they need. Yes, conversion was up by 5% and we sold more products than usual, so the test recipe was clearly the winner. However, I have noticed that this simple summary is rarely enough to draw a test analysis to a close. There are questions about 'did more people click on the new feature?' and 'did we see better performance from people who saw the new banner?'. There are questions about pathing ('why did more people go to the search bar instead of going to checkout?') and there are questions about these users. Then we can also provide all the in-built data segments from the testing tool itself. Whichever tool you use, I am confident it will have new vs return users; users by geographic region; users by traffic source; by landing page; by search term... any way of segmenting your normal website traffic data can be unleashed onto your test data and fill up those slides with pie charts and tables.

After all, segmentation is key, right? All those out-of-the-box segments are there in the tool because they're useful and can provide insight.

Well, I would argue that while they can provide more analysis, I'm not sure about more insights (as I wrote several years ago). And I strongly suspect that the out-of-the-box segments are there because they were easy to define and apply back when website analytics was new. Nowadays, they're there because they've always been there, and because managers who were there at the dawn of the World Wide Web have come to know and love them (even if they're useless. The metrics, not the managers).

Does it really help to know that users who came to your site from Bing performed better in Recipe B versus Recipe A? Well, it might - if the traffic profile during the test run was typical for your site. If it is, then go ahead and target Recipe B for users who came from Bing. And please ask your data why the traffic from Bing so clearly preferred Recipe B (don't just leave it at that).

Visitors from Bing performed better in Recipe B? So what?

Is it useful to know that return users performed better in Recipe C compared to Recipe A?

Not if most of your users make a purchase on their first visit: they browse the comparison sites, the expert review sites and they even look on eBay, and then they come to your site and buy on their first visit. So what if Recipe C was better for return users? Most of your users purchase on their first visit, and what you're seeing is a long-tail effect with a law of diminishing returns. And don't let the argument that 'All new users become return users eventually' sway you. Some new users just don't come back - they give up and don't try again. In a competitive marketplace where speed, efficiency and ease-of-use are now basic requirements instead of luxuries, if your site doesn't work on the first visit, then very few users will come back - they'll find somewhere easier instead.

And, and, and: if return users perform better, then why? Is it because they've had to adjust to your new and unwieldy design? Did they give up on their first visit, but decide to persevere with it and come back for more punishment because the offer was better and worth the extra effort? This is hardly a compelling argument for implementing Recipe C. (Alternatively, if you operate a subscription model, and your whole website is designed and built for regular return visitors, you might be on to something). It depends on the size of the segments. If a tiny fraction of your traffic performed better, then that's not really helpful. If a large section of your traffic - a consistent, steady source of traffic - performed better, then that's worth looking at.

So - how do we segment the data intelligently?

It comes back to those questions that our stakeholders ask us: "How many people clicked?" and "What happened to the people who clicked, and those who didn't?" These are the questions that are rarely answered with out-of-the-box segments. "Show me what happened to the people who clicked and those who didn't" leads to answers like, "We should make this feature more visible because people who clicked it converted at a 5% higher rate." You might get the answer that, "This feature gained a very high click rate, but made no impact [or had a negative effect] on conversion." This isn't a feature: it's a distraction, or worse, a roadblock.

The best result is, "People who clicked on this feature spent 10% more than those who didn't."

And - this is more challenging but also more insightful - what about people who SAW the new feature, but didn't click? We get so hung up on measuring clicks (because clicks are the currency of online commerce) that we forget that people don't read with their mouse button. Just because somebody didn't click on the message doesn't mean they didn't see it: they saw it and thought, "Not interesting," "not relevant" or "Okay, that's good to know but I don't need to learn more". The message that says, "10% off with coupon code SAVETEN - Click here for more" doesn't NEED to be clicked. And ask yourself "Why?" - why are they clicking, why aren't they? Does your message convey sufficient information without further clicking, or is it just a headline that introduces further important content. People will rarely click Terms and Conditions links, after all, but they will have seen the link.

We forget that people don't read with their mouse button.

So we're going to need to have a better understanding of impressions (views) - and not just at a page level, but at an element level. Yes, we all love to have our messages, features and widgets at the top of the page, in what my high school Maths teacher called "Flashing Red Ink". However, we also have to understand that it may have to be below the fold, and there, we will need to get a better measure of how many people actually scrolled far enough to see the message - and then determine performance for those people. Fortunately, there's an abundance of tools that do this; unfortunately, we may have to do some extra work to get our numerators and denominators to align. Clicks may be currency, but they don't pay the bills.

So: segmentation - yes. Lazy segmentation - no.

Friday, 17 May 2024

Multi-Armed Bandit Testing

I have worked in A/B testing for over 12 years, and blogged about it extensively. I've covered how to set up a hypothesis, how to test iteratively and even summarized the basics of A/B testing. I ran my first A/B test on my own website (long since deleted and now only in pieces on a local hard-drive) about 14 years ago. However, it has taken me this long to actually look into other ways of running online A/B tests apart from the equal 50-50 split that we all know and love.

My recent research led me to discover multi-armed bandit testing, which sounds amazing, confusing and possibly risky (don't bandits wear black eye-masks and operate outside the law??).

What is multi-armed bandit testing?

The term multi-armed bandit comes from a mathematical problem, which can be phrased like this:

A gambler must choose between multiple slot machines, or "one-armed bandits", each which has a different, unknown, likelihood of winning. The aim is to find the best or most profitable outcome by a series of choices. At the beginning of the experiment, when odds and payouts are unknown, the gambler must try each one-armed bandit to measure their payout rate, and then find a strategy to maximize winnings.

Over time, this will mean putting more money into the machine(s) which provide the best return.

Hence, the multiple one-armed bandits make this the “multi-armed bandit problem,” from which we derive multi-armed bandit testing.

The solution - to put more money into the machine which returns the best prizes most often - translates to online testing:, the testing platform dynamically changes the allocation of new test visitors to the recipes which are showing the best performance so far. Normally, traffic is allocated randomly between the recipes, but with multi-armed bandit testing traffic is skewed towards the winning recipe(s). Instead of the normal 50-50 split (or 25-25-25-25, or whichever), the traffic splits on a daily (or by visit) day.

We see two phases of traffic distribution while the test is running: initially, we have the 'exploration' phase, where the platform tests and learns, measuring which recipe(s) are providing the best performance (insert your KPI here). After a potential winner becomes apparent, the percentage of traffic to that recipe starts to increase, while the losers see less and less traffic. Eventually, the winner will see the vast majority of traffic - although the platform will continue to send a very small proportion of traffic to the losers, to continue to validate its measurements, and this is the 'exploitation' phase.

The graph for the traffic distribution over time may look something like this:

...where Recipe B is the winner.

So, why do a multi-armed bandit test instead of a normal A/B test?

If you need to test, learn and implement in a short period of time, then multi-armed may be the way forwards. For example, if marketing want to know which of two or three banners should accompany the current sales campaign (back to school; Labour Day; holiday weekend), you aren't going to have time to run the test, analyze the results and push the winner. The campaign ended while you were tinkering with your spreadsheets. With multi-armed bandit, the platform identifies the best recipes while the test is running, and implements it while the campaign is still active. When the campaign has ended, you will have maximized your sales performance by showing the winner while the campaign was active.

Wednesday, 10 January 2024

Statistics: Type 1 and Type 2 Errors

In statistics (and by extension, in testing), a Type I error is a false positive conclusion (we think a test recipe won when it didn't), while a Type II error is a false negative conclusion (we think the test recipe lost, when it didn't).

Making a statistical decision always involves uncertainties, because we're sampling instead of looking at the whole population. This means the risks of making these errors are unavoidable in hypothesis testing - we don't know everything because we can't measure everything. However, that doesn't mean we don't know anything - it just means we need to understand what we do and don't know.

The probability of making a Type I error is the significance level, or alpha (α), while the probability of making a Type II error is beta (β). Incidentally, the statistical power of a test is measured by 1- β. I'll be looking at the statistical power of a test in a future blog.

These risks can be minimized through careful planning in your test design.

To reduce Type 1 errors, which mean falsely rejecting the null hypothesis - and calling a winner when the results were flat - it is crucial to choose an appropriate significance level and stick to it. Being cautious when interpreting results and also considering what the findings mean may also help mitigate Type 1 errors. Different companies have different significance levels that they use when testing, depending on how cautious or ambitious they want to be with their testing program. If there are millions of dollars at risk per year, or developing a new site or design will cost months of work, then adopting a higher significance level (90% or higher) may be the order of the day. Conversely, if you're a smaller operator with less traffic, or a change that can be easily unpicked if things don't go as expected, then you could use a lower significance level (80% or higher).

It's worth saying at this point that human beings are lousy at understanding and interpreting probabilities, and that's generally. Confidence levels and probabilities are related but are not directly interchangeable. The difference in confidence between 90% and 80% is not the same as between 80% and 70%. It becomes more and more 'difficult' to increase a confidence level as you approach 100% confidence. After all, can you really say something is 100% certain to happen when you've only taken a sample (even if it's a really large sample)? On the other hand, it's easy to the point of inevitable that a small sample can give you a 50% confidence level. What did you prove? That a coin is equally likely to give you heads or tails?

Type 2 errors can be minimised by using high levels of statistical significance, or (unsurprisingly) by using a larger sample size. The sample size determines the degree of sampling error, which in turn sets the ability to detect the differences in a hypothesis test. A larger sample size increases the chances to capture the differences in the statistical tests, and also increases a test's power.

Practically speaking, Type 1 and Type 2 errors (false positives and false negatives) are an inherent feature of A/B testing, and the best ways to minimize them is to have a pre-agreed minimum sample size, and a pre-determined confidence level that everyone (business teams, marketing, testing team) are all agreed on. Otherwise, there'll be discussions and debates afterwards about what's a winner, what's confident, what's significant and what's actually a winner.

Monday, 14 November 2022

How many of your tests win?

As November heads towards December, and the end of the calendar year approaches, we start the season of Annual Reviews. It's time to identify, classify and quantify our successes and ~~failures~~ opportunities from 2022, and to look forward to 2023. For a testing program, this usually involves the number of tests we've run, and how many recipes were involved; how much money we made and how many of our tests were winners.

If I ask you, I don't imagine you'd tell me, but consider for a moment: how many of your tests typically win? How many won this year? Was it 50%? Was it 75%? Was it 90%? And how does this reflect on your team's performance?

50% or less

It's probably best to frame this as 'avoiding revenue loss'. Your company tested a new idea, and you prevented them from implementing it, thereby saving your company from losing a (potentially quantifiable) sum of money. You were, I guess, trying some new ideas, and hopefully pushed the envelope - in the wrong direction, but it was probably worth a try. Or maybe this shows that your business instincts are usually correct - you're only testing the edge cases.

Around 75%

If 75% of your tests are winning, then you're in a good position and probably able to start picking and choosing the tests that are implemented by your company. You'll have happy stakeholders who can see the clear incremental revenue that you're providing, and who can see that they're having good ideas.

90% or more

If you're in this apparently enviable position, you are quite probably running tests that you shouldn't be. You're probably providing an insurance policy for some very solid changes to your website; you're running tests that have such strong analytical support, clear user research or customer feedback behind them that they're just straightforward changes that should be made. Either that, or your stakeholders are very lucky, or have very good intuition about the website. No, seriously ;-)

Your win rate will be determined by the level of risk or innovation that your company are prepared to put into their tests. Are you testing small changes, well-backed by clear analytics? Should you be? Or are you testing off-the-wall, game-changing, future-state, cutting edge designs that could revolutionise the online experience?

I've said before that your test recipes should be significantly different from the current state - different enough to be easy to distinguish from control, and to give you a meaningful delta. That's not to say that small changes are 'bad', but if you get a winner, it will probably take longer to see it.

Another thought: the win rate is determined by the quality of the test ideas, and how adventurous the ideas are, and therefore the win rate is a measure of the teams who are driving the test ideas. If your testing team is focused on test ideas and has strengths in web analytics and customer experience metrics, then your team will probably have a high win rate. Conversely, if your team is responsible for the execution of test ideas which are produced by other teams, then a measure of test quality will be on execution, test timing, and quantity of the tests you run. You can't attribute the test win rate (high or low) to a team who develop tests; in fact, the quality of the code is a much better KPI.

What is the optimal test win rate? I'm not sure that there is one, but it will certainly reflect the character of your test program more than its performance.

Is there a better metric to look at? I would suggest "learning rate": how many of your tests taught you something? How many of them had a strong, clearly-stated hypothesis that was able to drive your analysis of your test (winner or loser) and lead you to learn something about your website, your visitors, or both? Did you learn something that you couldn't have identified through web analytics and path analysis? Or did you just say, "It won", or "It lost" and leave it there? Was the test recipe so complicated, or contain so many changes, that isolating variables and learning something was almost completely impossible?

Whatever you choose, make sure (as we do with our test analysis) that the metric matches the purpose, because 'what gets measured gets done'.

Similar posts I've written about online testing

Getting an online testing program off the ground
Building Momentum in Online testing
Testing vs Implementing Directly

Thursday, 25 August 2022

Testing Towards The Future State

Once or twice in the past, I've talked about how your testing program needs to align with various departments in your company if it's going to build momentum. For example, you need to test a design that's approved by your site design and branding teams (bright orange CTA buttons might be a big winner for you, but if your brand colour is blue, you're not going to get very far).

Or what happens if you test a design that wins but isn't approved by the IT team - they just aren't heading towards Flash animations and video clips, and they're going to start using 360-degree interactive images? The answer - you compiled and coded a very complicated dead-end.

But what about the future state of your business model? Are you trying to work out the best way to promote your best-selling product? Are you testing whether showing discounts as £s off or % off? This kind of testing assumes that pricing is important, but take a look at The Rolls Royce website which doesn't have any price information on it at all. Scary, isn't it? But apparently that's what a luxury brand looks like (and for a second example, try this luxury restaurant guide).

Apart from sharing the complicated and counter-intuitive navigation of the Rolls Royce site, it also shares a distinct lack of price information. Even the sorting and filtering excludes any kind of sorting by price - it's just not there.

So, if you're testing the best way of showing price information on your site while the business as a whole is moving towards a luxury status, then it's time to start rethinking your testing program and moving into line with the business.

Conversely, if you're moving your business model towards the mainstream audience in order to increase volumes, then it's time to start looking at pricing (for example) and making your site simpler, less ethereal and less vague, with content that's focused more on the actual features and benefits of the product, and less on the lifestyle. Take, for example, the luxury perfume adverts that proliferate in the run-up to Christmas. You can't convey a smell on television, or online, so instead we get these abstract adverts with people dancing on the moon; bathing in golden liquid or whatever, against a backdrop of classical music. Does it tell you the price? Does it tell you what it smells like? In some cases, does it even tell you what the product is called? Okay, it usually does, but it's a single word at the end, which they say out loud so you know how to pronounce it when you go shopping on the high street.

Compare those with, for example, toy adverts. Simple, bright, noisy, clear images of the product, repetition of the brand and product name and with the prices (recommended retail price) running constantly throughout, and at the end. Yes, there are legal requirements regarding toy adverts, but even so, no-one would ever think of a toy as a premium. Yet somehow, toys sell extremely well year after year, whether cheap or expensive, new or established brand.

So, make sure your testing is in line with business goals - not just KPIs, but the wider business strategy, branding and positioning. Don't go testing price presentation if the prices are being removed from your site; don't test colours of buttons which contravene your marketing guidelines for a classy monochrome site, and so on. Business goals are not always financial, so keep in touch with marketing!

Thursday, 24 June 2021

How long should I run my test for?

A question I've been facing more frequently recently is "How long can you run this test for?", and its close neighbour "Could you have run it for longer?"

Different testing programs have different requirements: in fact, different tests have different requirements. The test flight of the helicopter Ingenuity on Mars lasted 39.1 seconds, straight up and down. The Wright Brothers' first flight lasted 12 seconds, and covered 120 feet. Which was the more informative test? Which should have run longer?

There are various ideas around testing, but the main principle is this: test for long enough to get enough data to prove or disprove your hypothesis. If your hypothesis is weak, you may never get enough data. If you're looking for a straightforward winner/loser, then make sure you understand the concept of confidence and significance.

What is enough data? It could be 100 orders. It could be clicks on a banner : the first test recipe to reach 100 clicks - or 1,000, or 10,000 - is the winner (assuming it has a large enough lead over the other recipes).

An important limitation to consider is this: what happens if your test recipe is losing? Losing money; losing leads; losing quotes; losing video views. Can you keep running a test just to get enough data to show why it's losing? Testing suddenly becomes an expensive business, when each extra day is costing you revenue. One of the key advantages of testing over 'launch it and see' is the ability to switch the test off if it loses; how much of that advantage do you want to give up just to get more data on your test recipe?

Maybe your test recipe started badly. After all, many do: the change of experience from the normal site design to your new, all-improved, management-funded, executive-endorsed design is going to come as a shock to your loyal customers, and it's no surprise when your test recipe takes a nose-dive in performance for a few days. Or weeks. But how long can you give your design before you have to admit that it's not just the shock of the new design, (sometimes called 'confidence sickness') but that there are aspects of the new design that need to be changed before it will reach parity with your current site? A week? Two weeks? A month? Looking at data over time will help here. How was performance in week 1? Week 2? Week 3? It's possible for a test to recover, but if the initial drop was severe, then you may never recover the overall picture, but if you can find that the fourth week was actually flat (for new and return visitors) then you've found the point where users have adjusted to your new design.

If, however, the weekly gaps are widening, or staying the same, then it's time to pack up and call it a day.

Let's not forget that you probably have other tests in your pipeline which are waiting for the traffic that you're using on your test. How long can they wait until launch?

So, how long should you run your test for? As long as possible to get the data you need, and maybe longer if you can, unless it's
- suffering from confidence sickness (keep it running)
- losing badly, and consistently (unless you're prepared to pay for your test data)
- losing and holding up your testing pipeline

Similar posts I've written about online testing

Getting an online testing program off the ground
Building Momentum in Online testing
How many of your tests win?

Wright Brothers Picture:

"Released to Public: Wilber and Orville Wright with Flyer II at Huffman Prairie, 1904 (NASA GPN-2002-000126)" by pingnews.com is marked with CC PDM 1.0

Tuesday, 8 December 2020

A/B testing without a 50-50 split

Whenever people ask me what I do for a living, I [try not to] launch off into a little speech about how I improve website design and experience by running tests, where we split traffic 50-50 between test and control, and mathematically determine which is better. Over the years, it's been refined and dare I say optimized, but that's the general theme, because that's the easiest way of describing what I do. Simple.

There is nothing in the rules, however, that says you have to split traffic 50-50. We typically say 50-50 split because it's a random chance of being split into one of two groups - like tossing a coin, but that's just tradition (he says, tearing up the imaginary rule book).

Why might you want to test on a different split setting?

1. Maybe your test recipe is so completely 'out-there' and different from control that you're worried that it'll affect your site's KPIs, and you want to test more cautiously. So, why not do a 90-10? You only risk 10% of your total traffic, and providing that 10% is large enough to produce a decent sample size, which risk a further 40%? And if it starts winning, then maybe you increase to an 80-20 split, and move towards 50-50 eventually?

2. Maybe your test recipe is based on a previous winner, and you want to get more of your traffic into a recipe that should be a winner as quickly as possible (while also checking that it is still a winner). So you have the opportunity to test on a 10-90 split, with most of your traffic on the test experience and 10% held back as a control group to confirm your previous winner.

3. Maybe you need test data quickly - you are confident you can use historic data for the control group, but you need to get data on the test page/site/experience, and for that, you'll need to funnel more traffic into the test group. You can use a combination of historic data and control group data to measure the current state performance, and then get data on how customers interact with the new page (especially if you're measuring clicks on a new widget on the page, and how customers like or dislike it).

4. Maybe you're running a Multi-Armed Bandit test.

Things to watch out for

If you decide to run an A/B test on uneven splits, then beware:

- You need to emphasise conversion rates, and calculate your KPIs as "per visitor" or "per impression". I'm sure you do this already with your KPIs, but absolute numbers of orders or clicks, or revenue values will not be suitable here. If you have twice as much traffic in B compared to A (a 66-33 split), then you should expect twice as many success events from an identical success rate; you'll need to divide by visit, visitor or page view (depending on your metric, and your choice).

- You can't do multivariate analysis on uneven splits - as I mentioned in my articles on MVT analysis, you need equal-ish numbers of visits in order to combine the data from the different recipes.

Monday, 18 November 2019

Web Analytics: Requirements Gathering

Everybody knows why your company has a website, and everybody tracks the site's KPIs.

Except that this a modern retelling of the story of three blind men who tried to describe an elephant by touch alone, and everyone has a limited and specific view of your website. Are you tracking orders? Are you tracking revenue? Are you tracking traffic? Organic? Paid? Banner? Affiliate? Or, dare I ask, are you just tracking hits?

This siloed approach can actually work, with each person - or more likely, each team - working towards a larger common goal which can be connected to one of the site's actual KPIs. After all, more traffic should lead to more orders, in theory. The real problem arises when people from one team start talking to another about the success of a joint project. Suddenly, we have an unexpected culture clash and two teams, working within the same business, are speaking virtually different languages. The words are the same, but the definitions are different, and while everybody is using the same words, they're actually discussing very different concepts.

At this stage, it becomes essential to take a step back and take time to understand what everyone means when they use phrases like, "KPIs","success metrics", or even "conversion". I mean, everyone knows there's one agreed definition of conversion, right? No? Add to cart; complete order; complete a quote, or a lead-generation activity - I have seen and heard all of these called 'conversion'.

When it comes to testing, this situation can become amplified, as recipes are typically being proposed or supported by different teams with different aims. One team's KPIs may be very different from another's. As the testing lead, it's your responsibility to determine what the aims of the test are, and from them - and nothing else - what the KPIs are. Yes, you can have more than one KPI, but you must then determine which KPI is actually the most important (or dare I say, "key"), and negotiate these with your stakeholders.

A range of my previous pieces of advice on testing become more critical here, as you'll need to ensure that your test recipes really do test your hypothesis, and that the metrics will test the hypothesis. And, to avoid any doubt, make sure you actually define your success criteria in terms of basic metrics (visits, visitors, orders, revenue, page views, file downloads), so that everybody is on the same page (literally and metaphorically).

Keep everybody updated on your plans, and keep asking the obvious questions - assume as little as possible and make sure you gather all your stakeholders' ideas and requirements. What do you want to test? Why? What do you want to measure? Why?

Yes, you might sound like an insistent three-year-old, but it will be worth it in the end!

Wednesday, 28 November 2018

The Hierarchy of A/B Testing

As any A/B testing program matures, it becomes important to work out not only what you should test (and why), but also to start identifying the order in which to run your tests.

For example, let's suppose that your customer feedback team has identified a need for a customer support tool that helps customers choose which of your products best suits them. Where should it fit on the page? What should it look like? What should it say? What color should it be? Is it beneficial to customers? How are you going to unpick all these questions and come up with a testing strategy for this new concept?

These questions should be brought into a sequence of tests, with the most important questions answered first. Once you've answered the most important questions, then the rest can follow in sequence.
Firstly: PRESENCE: is this new feature beneficial to customers?
In our hypothetical example, it's great that the customer feedback team have identified a potential need for customers. The first question to answer is: does the proposed solution meet customer needs? And the test that follows from that is: what happens if we put it on the page? Not where (top versus bottom), or what it should look like (red versus blue versus green), but should it go anywhere on the page at all?

If you're feeling daring, you might even test removing existing content from the page. It's possible that content has been added slowly and steadily over weeks, months or even longer, and hasn't been tested at any point. You may ruffle some feathers with this approach, but if something looks out of place then it's worth asking why it was put there. If you get an answer similar to "It seemed like a good idea at the time" then you've probably identified a test candidate.

Let's assume that your first test is a success, and it's a winner. Customers like the new feature, and you can see this because you've looked at engagement with it - how many people click on it, hover near it, enter their search parameters and see the results, and it leads to improved conversion.

Next: POSITION: where should it fit on the page?
Your first test proved that it should go on the page - somewhere. The next step is to determine the optimum placement. Should it get pride of place at the top of the page, above the fold (yes, I still believe in 'the fold' as a concept)? Or is it a sales support tool that is best placed somewhere below all the marketing banners and product lists? Or does it even fit at the bottom of the page as a catch-all for customers who are really searching for your products?

Because web pages come in so many different styles...

This test will show you how engagement varies with placement for this tool - but watch out for changes in click through rates for the other elements on your page. You can expect your new feature to get more clicks if you place it at the top of the page, but are these at the expense of clicks on more useful page content? Naturally, the team that have been working on the new feature will have their own view on where the feature should be placed, but what's the best sequence for the page as a whole? And what's actually best for your customer?

Next: APPEARANCE: what should it look like?
This question covers a range of areas that designers will love to tweak and play with. At this point, you've answered the bigger questions around presence (yes) and position (optimum), and now you're moving on to appearance. Should it be big and bold? Should it fit in with the rest of the page design, or should it stand out? Should it be red, yellow, green or blue? There are plenty of questions to answer here, and you'll never be short of ideas to test.

Take care:
It is possible to answer multiple questions with one test that has multiple recipes, but take care to avoid addressing the later questions without first answering the earlier ones.
If you introduce your new feature in the middle of the page (without testing) and then start testing what the headline and copy should say, then you're testing in a blind alley, without understanding if you have the bets placement already. And if your test recipes all lose, was it because you changed the headline from "Find your ideal sprocket" to "Select the widget that suits you", or was it because the feature simply doesn't belong on the page at all?

Also take care not to become bogged down in fine detail questions when you're still answering more general questions. It's all too easy to become tangled up in discussions about whether the feature is black with white text, or white with black text, when you haven't even tested having the feature on the page. The cosmetic questions around placement and appearance are far more interesting and exciting than the actual necessary aspects of getting the new element onto the page and making it work.

For example, NASA recently landed another probe on Mars. It wasn't easy, and I don't imagine there were many people at NASA who were quibbling about the colour of the parachute or the colour of the actual space rocket. Most people were focused on actually getting the probe onto the martian surface. The same general rule applies in A/B testing - sometimes just getting the new element working and present on the page generates enough difficulties and challenges, especially if it's a dynamic element that involves calling APIs or other third-party services.

In those situations, yes, there are design questions to answer, but 'best guess' is a perfectly acceptable answer. What should it look like? Use your judgement; use your experience; maybe even use previous test data, and come back to it in a later test.

But don't go introducing additional complexity and more variables where they're really not welcome. What colour was the NASA parachute? The one that was easiest to produce.

Once your first test on presence has been completed, it becomes a case of optimizing any remaining details. CTA button wording and color; smaller elements within the new feature; the 'colour of the parachute' and so on. You'll find there's more interest in tweaking the design of a winner than there is in actually getting it working, but that's fine... just roll with it!

Similar posts I've written about online testing

Getting an online testing program off the ground
Building Momentum in Online testing
How many of your tests win?

How long should I run my test for?

Monday, 25 June 2018

Data in Context (England 6 - Panama 1)

There's no denying it, England have made a remarkable and unprecedented start to their World Cup campaign. 6-1 is their best ever score in a World Cup competition, exceeding their previous record of 3-0 against Paraguay and against Poland (both achieved in the Mexico '86 competition). A look at a few data points emphasises the scale of the win:

* The highest ever England win (any competition) is 13-0 against Ireland in February 1882.
* England now share the record for most goals in the first half of a World Cup game (five, joint record with Germany, who won 7-1 against Brazil in 2014).
* The last time England scored four or more goals in a World Cup game was in the final of 1966.
* Harry Kane joins Ron Flowers (1962) as the only players to score in England's first two games at a World Cup tournament.

However, England are not usually this prolific - they scored as many goals against Panama on Sunday as they had in their previous seven World Cup matches in total. This makes the Panama game an outlier; an unusual result; you could even call it a freak result... Let's give the data a little more context:

- Panama are playing in their first World Cup ever, and that they scored their first ever goal in the World Cup against England.
- Panama's qualification relied on a highly dubious (and non-existent) "ghost goal"

- Panama's world ranking is 55th (just behind Jamaica) down from a peak of 38th in 2013. England's world ranking is 12th.
- Panama's total population is around 4 million people. England's is over 50 million. London alone has 8 million. (Tunisia has around 11 million people).

Sometimes we do get freak results. You probably aren't going to convince an England fan about this today, but as data analysts, we have to acknowledge that sometimes the data is just anomalous (or even erroneous). At the very least, it's not representative.

When we don't run our A/B tests for long enough, or we don't get a large enough sample of data, or we take a specific segment which is particularly small, we leave ourselves open to the problem of getting anomalous results. We have to remember that in A/B testing, there are some visitors who will always complete a purchase (or successfully achieve a site goal) on our website, no matter how bad the experience is. And some people will never, ever buy from us, no matter how slick and seamless our website is. And there are some people who will have carried out days or weeks of research on our site, before we launched the test, and shortly after we start our test, they decide to purchase a top-of-the-range product with all the add-ons, bolt-ons, upgrades and so on. And there we have it - a large, high-value order for one of our test recipes which is entirely unrelated to our test, but which sits in Recipe B's tally and gives us an almost-immediate winner. So, make sure you know how long to run a test for.

The aim of a test is to nudge people from the 'probably won't buy' category into the 'probably will buy' category, and into the 'yes, I will buy' category. Testing is about finding the borderline cases and working out what's stopping them from buying, and then fixing that blocker. It's not about scoring the most wins, it about getting accurate data and putting that data into context.

Rest assured that if Panama had put half a dozen goals past England, it would widely and immediately be regarded as a freak result (that's called bias, and that's a whole other problem).

Tuesday, 19 June 2018

When Should You Switch A Test Off? (Tunisia 1 - England 2)

Another day yields another interesting and data-rich football game from the World Cup. In this post, I'd like to look at answering the question, "When should I switch a test off?" and use the Tunisia vs England match as the basis for the discussion.

Now, I'll admit I didn't see the whole match (but I caught a lot of it on the radio and by following online updates), but even without watching it, it's possible to get a picture of the game from looking at the data, which is very intriguing. Let's kick off with the usual stats:

The result after 90 minutes was 1-1, but it's clear from the data that this would be a very one-sided draw, with England having most of the possession, shots and corners. It also appears that England squandered their chances - the Tunisian goalkeeper made no saves, but England could only get 44% of their 18 shots on target (which kind of begs the question - what about the others - and the answer is that they were blocked by defenders). There were three minutes of stoppage time, and that's when England got their second goal.

[This example also shows the unsuitability of the horizontal bar graph as a way of representing sports data - you can't compare shot accuracy (44% vs 20% doesn't add up to 100%) and when one team has zero (bookings or saves) the bar disappears completely. I'll fix that next time.]

So, if the game had been stopped at 90 minutes as a 1-1 draw, it's fair to say that the data indicates that England were the better team on the night and unlucky to win. They had more possession and did more with it.

Comparison to A/B testing

If this were a test result and your overall KPI was flat (i.e. no winner, as in the football game), then you could look at a range of supporting metrics and determine if one of the test recipes was actually better, or if it was flat. If you were able to do this while the test was still running, you could also take a decision on whether or not to continue with the test.

For example, if you're testing a landing page, and you determine that overall order conversion and revenue metrics are flat - no improvement for the test recipe - then you could start to look at other metrics to determine if the test recipe really has identical performance to the control recipe. These could include bounce rate; exit rate; click-through rate; add-to-cart performance and so on. These kind of metrics give us an indication of what would happen if we kept the test running, by answering the question: "Given time, are there any data points that would eventually trickle through to actual improvements in financial metrics?"

Let's look again at the soccer match for some comparable and relevant data points:

* Tunisia are win-less in their last 12 World Cup matches (D4 L8). Historic data indicates that they were unlikely to win this match.

* England had six shots on target in the first half, their most in the opening 45 minutes of a World Cup match since the 1966 semi-final against Portugal. In this "test", England were trending positively in micro-metrics (shots on target) from the start.

* Tunisia scored with their only shot on target in this match, their 35th-minute penalty. Tunisia were not going to score any more goals in this game.

* England's Kieran Trippier created six goalscoring opportunities tonight, more than any other player has managed so far in the 2018 World Cup. "Creating goalscoring opportunities" is typically called "assists" and isn't usually measured in soccer, but it shows a very positive result for England again.

As an interesting comparison - would the Germany versus Mexico game have been different if the referee had allowed extra time? Recall that Mexico won 1-0 in a very surprising result, and the data shows a much less one-sided game. Mexico won 1-0 and, while they were dwarfed by Germany, they put up a much better set of stats than Tunisia (compare Mexico with 13 shots vs Tunisia with just one - which was their penalty). So Mexico's result, while surprising, does show that they did play an attacking game and should have achieved at least a draw, while Tunisia were overwhelmed by England (who, like Germany should have done even better with their number of shots).

It's true that Germany were dominating the game, but weren't able to get a decent proportion of shots on target (just 33%, compared to 40% for England) and weren't able to fully shut out Mexico and score. Additionally, the Mexico goalkeeper was having a good game and according to the data was almost unbeatable - this wasn't going to change with a few extra minutes.

Upcoming games which could be very data-rich: Russia vs Egypt; Portugal vs Morocco.

Other articles I've written looking at data and football

Checkout Conversion: A Penalty Shootout
When should you switch off an A/B test?
The Importance of Being Earnest with your KPIs
Should Chelsea sack Jose Mourinho? (It was a relevant question at the time, and I looked at what the data said)
How Exciting is the English Premier League? what does the data say about goals per game?

Monday, 14 May 2018

Online Optimisation: Testing Sequences

As your online optimisation program grows and develops, it's likely that you'll progress from changing copy or images or colours, and start testing moving content around on the page - changing the order of the products that you show; moving content from the bottom of the page to the top; testing to see if you achieve greater engagement (more clicks; lower bounce rate; lower exit rate) and make more money (conversion; revenue per visitor). A logical next step up from 'moving things around' is to test the sequence of elements in a list or on a page. After all, there's no new content, no real design changes, but there's a lot of potential in changing the sequence of the existing content on the page.
Sequencing tests can look very simple, but there are a number of complexities to think about - and mathematically, the numbers get very large very quickly.

As an example, here's the Ford UK's cars category page, www.ford.co.uk/cars.

[The page scrolls down; I've split it into two halves and shown them side-by-side].

Testing sequences can quickly become a very mathematical process: if you have just three items in a list, then the number of recipes is six; if you have four items, then there are 24 different sequences (see combinations without repetition). Clearly, some of these will make no sense (either logically or financially) so you can cut out some of the options, but that's still going to leave you with a large number of potential sequences. In Ford's example here, with 20 items in the list, there are 2,432,902,008,176,640,000 different options.

Looking at Ford, there appears to be some form of sorting (default) which is generally price low-to-high and slightly by size or price, with a few miscellaneous tagged onto the end (the Ford GT, for example). At first glance, there's very little difference between many of the cars - they look very, very similar (there's no sense of scale or of the specific features of each model).

Since there are two quintillion various ways of sequencing this list, we need to look at some 'normal' approaches, and are, of course, a number of typical ways of sorting products that customers are likely to gravitate towards - sorting by alphabetical order; sorting by price or perceived value (i.e. start with the the lower quality products and move to luxury quality), and you could also add to that sorting by most popular (drives the most clicks or sales). Naturally, if your products have another obvious sorting option (such as size, width, length or whatever) then this could also be worth testing.

What are the answers? As always: plan your test concept in advance. Are you going to use 'standard' sorting options, such as size or price, or are you going to do something based on other metrics (such as click-through-rate, revenue or page popularity)? What are the KPIs you're going to measure? Are you going for clicks, or revenue? This may lead to non-standard sequences, where there's no apparent logic to the list you produce. However, once you've decided, the number of sequences falls from trillions to a handful, and you can start to choose the main sequences you're going to test.

For Ford, price low to high (or size large to small), popularity (sales), grouping by model size (hatchback, saloon, off-road/SUV, sports) may also work - and that leads on to sub-categorization and taxonomy, which I'll probably cover in an upcoming blog.

Web Optimisation, Maths and Puzzles

Header tag

Sunday, 24 November 2024

Testing versus Implementing - why not just switch it on?

Monday, 18 November 2024

Designing Personas for Design Prototypes

Wednesday, 10 July 2024

How not to Segment Test Data

Other articles I've written on Website Analytics that you may find relevant:

Friday, 17 May 2024

Multi-Armed Bandit Testing

What is multi-armed bandit testing?

So, why do a multi-armed bandit test instead of a normal A/B test?

Wednesday, 10 January 2024

Statistics: Type 1 and Type 2 Errors

Monday, 14 November 2022

How many of your tests win?

Thursday, 25 August 2022

Testing Towards The Future State

Thursday, 24 June 2021

How long should I run my test for?

Tuesday, 8 December 2020

A/B testing without a 50-50 split

Why might you want to test on a different split setting?

Things to watch out for

Monday, 18 November 2019

Web Analytics: Requirements Gathering

Wednesday, 28 November 2018

The Hierarchy of A/B Testing

Monday, 25 June 2018

Data in Context (England 6 - Panama 1)

Tuesday, 19 June 2018

When Should You Switch A Test Off? (Tunisia 1 - England 2)

Monday, 14 May 2018

Online Optimisation: Testing Sequences