So, your test is up and running! You've identified where to test and what to test, and you are now successfully splitting traffic between your test recipes. How long do you keep the test running, and when do you call a winner? You've heard about statistical significance and confidence, but what does it actually mean?
Anil Batra has recently posted on the subject of statistical significance, and I'll be coming to his article later, but for now, I'd like to begin with an analogy.
Let us suppose that two car manufacturers, Red-Top and Blue-Bottle have each been working on a new car design for the Formula 1 season, and each manufacturer believes that their car is the fastest at track racing. The solution to this debate seems easy enough - put them against each other, side-by-side - one lap of a circuit, first one back wins. However, neither team is particularly happy with this idea - there's discussion of optimum racing line, getting the apex of the bends right, and different acceleration profiles. It's not going to be workable.
Some bright scientist suggests a time trial: one lap, taken by each car (one after the other) and the quickest time wins. This works, up to a point. After all, the original question was, "Which car is the fastest for track racing?" and not, "Which car can go from a standing start to complete a lap quickest?" and there's a difference between the two. Eventually, everybody comes to an agreement: the cars will race and race until one of them has a clear overall lead - 10 seconds (for example), at the end of a lap. For the sake of this analogy, the cars can start at two different points on the circuit, to avoid any of the racing line issues that we mentioned before. We're also going to ignore the need to stop for fuel or new tyres, and any difference in the drivers' ability - it's just about the cars. The two cars will keep racing until there is a winner (a lead of 10 seconds) or until the adjudicators agree that neither car will accrue an advantage that large.
So, the two cars set off from their points on the circuit, and begin racing. The Red-Top car accelerates very quickly from the standing start, and soon has a 1-second lead on the Blue-Bottle. However, the Blue-Bottle has better brakes which enable it to corner better, and after 20 laps there's nothing in it. The Blue-Bottle continues to show improved performance, and after 45 laps, the Blue-Bottle has built a lead of 6.0 seconds. However, the weather changes from sunny to overcast and cloudy, and the Blue-Bottle is unable to extend its lead over the next 15 laps. The adjudicators call it a day after 60 laps total.
So, who won?
There are various ways of analysing and presenting the data, but let's take a look at the data and work from there. The raw data for this analysis is here: Racing Car Statistical Significance Spreadsheet.
This first graph shows the lap times for each of the 60 laps:
This first graph tells the same story as the paragraphs above: laps 1-20 show no overall lead for either car; the blue car is faster from laps 20-45, then from laps 45-60 neither car gains a consistent advantage. This second graph shows the cumulative difference between the performance of the two cars. It's not one that's often shown in online testing tools, but it's a useful way of showing which car is winning. If the red car is winning, then the time difference is negative; if the blue car is ahead, the time difference is positive, and the size of the lead is measured in seconds.
Graph 3, below, is a graph that you will often see (or produce) from online testing tools. It's the cumulative average report - in this case, cumulative average lap time. After each lap, the overall average lap time is calculated for all the laps that have been completed so far. Sometimes called performance 'trend lines', these show at a glance a summary of which car has been winning, which car is winning now, and by how much. Again, to go back to the original story, we can see how for the first 20 laps, the red car is winning; at 20 laps, the red and blue lines cross (indicating a change in the lead, from red to blue); from laps 20 to 45 we see the gap between the two lines widening, and then how they are broadly parallel from laps 45 to 60.
So far, so good. Graph 4, below, shows the distribution of lap times for the two cars. This is rarely seen in online testing tools, and looks better suited to the maths classroom. With this graph, it's not possible to see who was winning, when, but it's possible to see who was winning at the end. This graph, importantly, shows the difference in performance in a way which can be analysed mathematically to show not only which car was winning, but how confident we can be that it was a genuine win, and not a fluke. We can do this by looking at the average (mean) lap time for each car, and also at the spread of lap times. This isn't going to become a major mathematical treatment, because I'm saving that for next time :-) However,you can see here that on the whole, the blue car's lap times are faster (the blue peak is to the left, indicating a larger number of shorter lap times) but are slightly more spread out - the blue car has both the fastest and slowest times.
The maths results are as follows:
Overall -
Red:
Average (mean) = 102.32 seconds.
Standard deviation (measure of spread) = 0.21
Blue: average (mean) = 102.22 seconds (0.1 seconds faster per lap).
Standard deviation = 0.28 seconds (lap times are spread more widely)
Mathematically, if the average times for the cars are two or more standard deviations apart, then we can say with 99.99% confidence that the results are significant (i.e. are not due to noise, fluke or random chance). In this case, the results are only around half a standard deviation apart, so it's not possible to say that either car is really a winner.
But hang on, the blue car was definitely winning after 60 laps. The reason for this is its performance between laps 20 and 45, when it was consistently building a lead over the red car (before the weather changed, in our story). Let's take a look at the distribution of results for these 26 laps:
A very different story emerges. The times for both cars have a much smaller spread, and the peak for the blue distribution is much sharper (in English, the blue car's performance was much more consistent from lap to lap). Here are the stats for this section of the race:
Red:
Average (mean) = 102.31 seconds
Standard deviation (measure of spread) = 0.08
Blue: average (mean) = 102.08 seconds (0.23 seconds faster per lap)
Standard deviation = 0.11 seconds (lap times have a narrower distribution)
We can now see how the Blue car won; over the space of 26 laps, it was faster, and more consistently faster too. The difference between the two averages = 102.31 - 102.08 = 0.23 seconds, and this is over twice the standard deviation for the blue car (0.11 x 2 = 0.22). Fortunately, most online testing tools will give you a measure of the confidence in your data, so you won't have to get your spreadsheet or calculator out and start calculating standard deviations manually.
Now, here's the question: are you prepared to call the Blue car a clear winner, based on just part of the data?
Think about this in terms of the performance of an online test between two recipes, Blue and Red. Would you have called the Red recipe a winner after 10-15 days/laps? In the same way as a car and driver need time to settle down into a race (acceleration etc), your website visitors will certainly need time to adjust to a new design (especially if you have a high proportion of repeat visitors). How long? It depends :-)
In the story, the Red car had better acceleration from the start, but the Blue car had better brakes. Maybe one of your test recipes is more appealing to first time visitors, but the other works better for repeat visitors, or another segment of your traffic. Maybe you launched the test on a Monday, and one recipe works better on weekends?
So why did the results perform differently between laps 20-45 and 45-60? Laps 20-45 are 'normal' conditions, whereas after lap 45, something changed, and n the racing car story, it was due to the weather. In the online environment, it could be a marketing campaign that you just launched, or your competitors launched. Maybe a new product, or the start of national holiday, school holiday, or similar? From that point onward, the performance of the Blue recipe was comparable or identical to the Red.
So, who won? The Blue car, since its performance in normal conditions was better. It took time to settle down, but in a normal environment, it's 0.23 seconds faster per lap, with 99+% confidence. Would you deploy the equivalent Blue recipe in an online environment, or do you think it's cheating to only deploy a winner that is better only during normal conditions, and is just comparable to the Red recipe during campaign periods? :-)
Let's take a look at Anil Batra's post on testing and significance. It's a much briefer article than mine (I apologise for the length, and thank you for your patience), but it does explain that you shouldn't stop a test too early. The question that many people ask is - how long do you let it run for? And how do you know when you've got a winner (or is everything turning flat?)? The short article has a very valid point: don't stop too soon!
Next time - a closer, mathematical look at standard deviations, means and distributions, and how they can help identify a winner with confidence! In the meantime, if you're looking for a more mathematical treatment, I recommend this one from the Online Marketing Tests blog.