If you've been presenting or listening to A/B test results (from online or offline tests) for a while, you'll probably have been asked to explain what 'confidence' or 'statistical significance' is.
A simple way of describing the measure of confidence is:
The probability (or likelihood) that this result (win or lose) will continue.
100% means this result is certain to continue, 50% means it's 50-50 on if it will win or lose. Please note that this is just a SIMPLE way of describing confidence, it's not mathematically rigorous.
Statistical significance (or just 'significance') is achieved when the results reach a certain pre-agreed level, typically 75%, 80% or 90%.
A note: noise and anomalous results in the early part of the test may lead you to see large wins with high confidence. You need to consider the volume of orders (or successes) and traffic in your results, and observe the daily results for your test, until you can see that the effects of these early anomalies have been reduced.
Online testers frequently ask how long a test should run for - what measures should we look at, and when are we safe to assume that our test is complete (and the data is reliable). I would say that looking at confidence and at daily trends should give you a good idea.
It's infuriating, but there are occasions when more time means less conclusive results: a test can start with a clear winner, but after time the result starts to flatten out (i.e. the winning lift decreases and confidence falls). If you see this trend, then it's definitely time to switch the test off.
Conversely, you hope that you'll see flattish results initially, and then a clear winner begin to develop, with one recipe consistently outperforming the other(s). Feeding more time, more traffic and more orders into the test gives you an increasingly clear picture of the test winner; the lifts will start to stabilise and the confidence will also start to grow. So the question isn't "How long do I keep my test running?" but "How many days of consistent uplift do you look for? And what level of confidence do I require to call a recipe a winner?"
What level of confidence do I need to call a test a winner?
Note that you may have different criteria for calling a winner compared to calling a loser. I'm sure the mathematical purists will cry foul, and say that this sounds like cooking the books, or fiddling the results, but consider this: if you're looking for a winner that you're going to implement through additional coding (and which may require an investment of time and money) then you'll probably want to be sure that you've got a definite winner that will provide a return on your money, so perhaps the win criteria would be 85% confidence with at least five days of consistent positive trending.
On the other hand, if your test is losing, then every day that you keep it running is going to cost you money (after all, you're funneling a fraction of your traffic through a sub-optimal experience). So perhaps you'll call a loser with just 75% confidence and five days of consistent under-performing. Here, the question becomes "How much is it going to cost me in immediate revenue to keep it running for another day?" and the answer is probably "Too much! Switch it off!!" This is not a mathematical pursuit, along the lines of "How much money do we need to lose to achieve our agreed confidence levels?", this is real life profit-and-loss.
In a future blog post, I'll provide a more mathematical treatment of confidence, explaining how it's calculate from a statistical standpoint, so that you have a clear understanding of the foundations behind the final figures.
No comments:
Post a Comment