Friday, 17 May 2013

A/B testing - how long to test for?

So, your test is up and running!  You've identified where to test and what to test, and you are now successfully splitting traffic between your test recipes.  How long do you keep the test running, and when do you call a winner?  You've heard about statistical significance and confidence, but what does it actually mean?

Anil Batra has recently posted on the subject of statistical significance, and I'll be coming to his article later, but for now, I'd like to begin with an analogy.

Let us suppose that two car manufacturers, Red-Top and Blue-Bottle have each been working on a new car design for the Formula 1 season, and each manufacturer believes that their car is the fastest at track racing.  The solution to this debate seems easy enough - put them against each other, side-by-side - one lap of a circuit, first one back wins.  However, neither team is particularly happy with this idea - there's discussion of optimum racing line, getting the apex of the bends right, and different acceleration profiles.  It's not going to be workable.

Some bright scientist suggests a time trial:  one lap, taken by each car (one after the other) and the quickest time wins.  This works, up to a point.  After all, the original question was, "Which car is the fastest for track racing?" and not, "Which car can go from a standing start to complete a lap quickest?" and there's a difference between the two.  Eventually, everybody comes to an agreement:  the cars will race and race until one of them has a clear overall lead - 10 seconds (for example), at the end of a lap.  For the sake of  this analogy, the cars can start at two different points on the circuit, to avoid any of the racing line issues that we mentioned before.  We're also going to ignore the need to stop for fuel or new tyres, and any difference in the drivers' ability - it's just about the cars.  The two cars will keep racing until there is a winner (a lead of 10 seconds) or until the adjudicators agree that neither car will accrue an advantage that large.

So, the two cars set off from their points on the circuit, and begin racing.  The Red-Top car accelerates very quickly from the standing start, and soon has a 1-second lead on the Blue-Bottle.  However, the Blue-Bottle has better brakes which enable it to corner better, and after 20 laps there's nothing in it.  The Blue-Bottle continues to show improved performance, and after 45 laps, the Blue-Bottle has built a lead of 6.0 seconds.  However, the weather changes from sunny to overcast and cloudy, and the Blue-Bottle is unable to extend its lead over the next 15 laps.  The adjudicators call it a day after 60 laps total.

So, who won?

There are various ways of analysing and presenting the data, but let's take a look at the data and work from there.  The raw data for this analysis is here:  Racing Car Statistical Significance Spreadsheet.

 This first graph shows the lap times for each of the 60 laps:

This first graph tells the same story as the paragraphs above:  laps 1-20 show no overall lead for either car; the blue car is faster from laps 20-45, then from laps 45-60 neither car gains a consistent advantage.  This second graph shows the cumulative difference between the performance of the two cars.  It's not one that's often shown in online testing tools, but it's a useful way of showing which car is winning.  If the red car is winning, then the time difference is negative; if the blue car is ahead, the time difference is positive, and the size of the lead is measured in seconds.
Graph 3, below, is a graph that you will often see (or produce) from online testing tools.  It's the cumulative average report - in this case, cumulative average lap time.  After each lap, the overall average lap time is calculated for all the laps that have been completed so far.  Sometimes called performance 'trend lines', these show at a glance a summary of which car has been winning, which car is winning now, and by how much.  Again, to go back to the original story, we can see how for the first 20 laps, the red car is winning; at 20 laps, the red and blue lines cross (indicating a change in the lead, from red to blue); from laps 20 to 45 we see the gap between the two lines widening, and then how they are broadly parallel from laps 45 to 60.
So far, so good.  Graph 4, below, shows the distribution of lap times for the two cars.  This is rarely seen in online testing tools, and looks better suited to the maths classroom.  With this graph, it's not possible to see who was winning, when, but it's possible to see who was winning at the end.  This graph, importantly, shows the difference in performance in a way which can be analysed mathematically to show not only which car was winning, but how confident we can be that it was a genuine win, and not a fluke.  We can do this by looking at the average (mean) lap time for each car, and also at the spread of lap times.
This isn't going to become a major mathematical treatment, because I'm saving that for next time :-)  However,you can see here that on the whole, the blue car's lap times are faster (the blue peak is to the left, indicating a larger number of shorter lap times) but are slightly more spread out - the blue car has both the fastest and slowest times.

The maths results are as follows:
Overall -
Average  (mean) = 102.32 seconds.
Standard deviation (measure of spread) = 0.21

Blue:  average (mean) = 102.22 seconds (0.1 seconds faster per lap).
Standard deviation = 0.28 seconds (lap times are spread more widely)

Mathematically, if the average times for the cars are two or more standard deviations apart, then we can say with 99.99% confidence that the results are significant (i.e. are not due to noise, fluke or random chance).  In this case, the results are only around half a standard deviation apart, so it's not possible to say that either car is really a winner.

But hang on, the blue car was definitely winning after 60 laps.  The reason for this is its performance between laps 20 and 45, when it was consistently building a lead over the red car (before the weather changed, in our story).  Let's take a look at the distribution of results for these 26 laps:

A very different story emerges.  The times for both cars have a much smaller spread, and the peak for the blue distribution is much sharper (in English, the blue car's performance was much more consistent from lap to lap).  Here are the stats for this section of the race:

Average  (mean) = 102.31 seconds
Standard deviation (measure of spread) = 0.08

Blue:  average (mean) = 102.08 seconds (0.23 seconds faster per lap)
Standard deviation = 0.11 seconds (lap times have a narrower distribution)

We can now see how the Blue car won; over the space of 26 laps, it was faster, and more consistently faster too.  The difference between the two averages = 102.31 - 102.08 = 0.23 seconds, and this is over twice the standard deviation for the blue car (0.11 x 2 = 0.22).  Fortunately, most online testing tools will give you a measure of the confidence in your data, so you won't have to get your spreadsheet or calculator out and start calculating standard deviations manually.

Now, here's the question:  are you prepared to call the Blue car a clear winner, based on just part of the data?

Think about this in terms of the performance of an online test between two recipes, Blue and Red.  Would you have called the Red recipe a winner after 10-15 days/laps?  In the same way as a car and driver need time to settle down into a race (acceleration etc), your website visitors will certainly need time to adjust to a new design (especially if you have a high proportion of repeat visitors).  How long?  It depends :-)

In the story, the Red car had better acceleration from the start, but the Blue car had better brakes.  Maybe one of your test recipes is more appealing to first time visitors, but the other works better for repeat visitors, or another segment of your traffic.  Maybe you launched the test on a Monday, and one recipe works better on weekends?

So why did the results perform differently between laps 20-45 and 45-60?  Laps 20-45 are 'normal' conditions, whereas after lap 45, something changed, and n the racing car story, it was due to the weather.  In the online environment, it could be a marketing campaign that you just launched, or your competitors launched.  Maybe a new product, or the start of national holiday, school holiday, or similar?  From that point onward, the performance of the Blue recipe was comparable or identical to the Red.

So, who won?  The Blue car, since its performance in normal conditions was better.  It took time to settle down, but in a normal environment, it's 0.23 seconds faster per lap, with 99+% confidence.  Would you deploy the equivalent Blue recipe in an online environment, or do you think it's cheating to only deploy a winner that is better only during normal conditions, and is just comparable to the Red recipe during campaign periods?  :-)

Let's take a look at Anil Batra's post on testing and significance.  It's a much briefer article than mine (I apologise for the length, and thank you for your patience), but it does explain that you shouldn't stop a test too early.  The question that many people ask is - how long do you let it run for?  And how do you know when you've got a winner (or is everything turning flat?)? The short article has a very valid point:  don't stop too soon!

Next time - a closer, mathematical look at standard deviations, means and distributions, and how they can help identify a winner with confidence!  In the meantime, if you're looking for a more mathematical treatment, I recommend this one from the Online Marketing Tests blog.

Tuesday, 14 May 2013

Web Analytics and Testing: Summary so far

It's hard to believe that it's two years since I posted my first blog post on web analytics.  I'd decided to take the step of sharing a solution I'd found to a question I'd once been asked by a senior manager:  "Show me all the pages on our site which aren't getting any traffic."  It's a good question, but not one that's easy to answer, and as it happened, it was a real puzzler for me at the time, and I couldn't come up with the answer quickly enough.  Before I could devise the answer, we were already moving on to the next project.  But I did find an answer (although we never implemented it), and thought about how to share it.

Nevertheless, I decided to blog about my solution, and my first blog post was received kindly by the online community, and so I started writing more around web analytics - sporadically, to be sure - and covering online testing, which is my real area of interest.

Here's a summary of the web analytics and online testing posts that I've written over the last two years.

Pages with Zero Traffic

Here's where it all started, back in May 2011, with the problem I outlined above.  How can you identify which pages on your site aren't getting traffic, when the only tools you have are tag-based (or server-log-based), and which only fire when they are visited?

Web Analytics - Reporting, Forecasting, Testing and Analysing
What do these different terms mean in web analytics?  What's the difference between them - aren't they just the same thing?

Web Analytics - Experimenting to Test a Hypothesis
My first post dedicated entirely to testing - my main online interest.  It's okay to test - in fact, it's a great idea - but you need to know why you're testing, and what you hope to achieve from the test.  This is an introduction to testing, discussing what the point of testing should be.

Web Analytics - Who determines an actionable insight?
The drive in analytics is for actionable insights:  "The data shows this, this and this, so we should make this change on our site to improve performance."  The insight is what the data shows; the actionable part is the "we should make this change".  If you're the analyst, you may think you decide what's actionable or not, but do you?  This is a discussion around the limitations of actionability, and a reminder to focus your analysis on things that really can be actionable.

Web Analytics - What makes testing iterative?
What does iterative testing mean?  Can't you just test anything, and implement it if it wins?  Isn't all testing iterative?  This article looks at what iteration means, and how to become more successful at testing (or at least learn more) by thinking about testing as a consecutive series, not a large number of disconnected one-off events.

A/B testing - A Beginning
The basic principles of A/B testing - since I've been talking about it for some time, here's an explanation of what it does and how it works.  A convenient place to start from when going on to the next topic...

Intro To Multi Variate Testing
...and the differences between MVT and A/B.

Multi-Variate Testing
Multi Variate Testing - MVT  - is a more complicated but powerful way of optimising the online experience, by changing a multitude of variables in one go.  I use a few examples to explain how it works, and how multiple variables can be changed in one test, and still provide meaningful results.  I also discuss the range of tools available in the market at the moment, and the potential drawbacks of not doing MVT correctly.

Web Analytics:  Who holds the steering wheel?
This post was inspired by a video presentation from the Omniture (Adobe) EMEA Summit in 2011.  It showed how web analytics could power your website into the future, at high speed and with great performance, like a Formula 1 racing car.  My question in response was, "Who holds the steering wheel?" I discuss how it's possible to propose improvements to a site by looking at the data and demonstrating what the uplift could be, but how it all comes down to the driver, who provides the direction and, also importantly, has his foot on the brake.

Web Analytics:  A Medical Emergency

This post starts with a discussion about a medical emergency (based on the UK TV series 'Casualty') and looks at how we, as web analysts, provide stats and KPIs to our stakeholders and managers.  Do we provide a medical readout, where all the metrics are understood by both sides (blood pressure, temperature, pulse rate...) or are we constantly finding new and wonderful metrics which aren't clearly understood and are not actionable?  If you only had 10 seconds to provide the week's KPIs to your web manager, would you be able to do it?  Which would you select, and why?

Web Analytics:  Bounce Rate Issues
Bounce rate (the number of people who exit your site after loading just one page, divided by all the people who landed on that page) is a useful but dangerous measure of page performance.  What's the target bounce rate for a page?  Does it have one?  Does it vary by segment (where is the traffic coming from? Do you have the search term?  Is it paid search or natural?)?  Whose fault is it if the bounce rate gets worse?  Why?  It's a hotly debated topic, with marketing and web content teams pointing the finger at each other.  So, whose fault is it, and how can the situation be improved?

Why are your pages getting no traffic?

Having discussed a few months earlier how to identify which pages aren't getting any traffic, this is the follow-up - why aren't your pages getting traffic?  I look at potential reasons - on-site and off-site, and technical (did somebody forget to tag the new campaign page?).

A beginner's social media strategy

Not strictly web analytics or testing, but a one-off foray into social media strategy.  It's like testing - make sure you know what the plan is before you start, or you're unlikely to be successful!

The Emerging Role of the Analyst
A post I wrote specifically for another site - hosted on my blog, but with reciprocal links to a central site where other bloggers share their thoughts on how Web Analytics, and Web Analysts in particular, are becoming more important in e-commerce.

MVT:  A simplified explanation of complex interactions

Multi Variate Testing involves making changes to a number of parts of a page, and then testing the overall result.  Each part can have two or more different versions, and this makes the maths complicated.  An additional issue occurs when one version of one part of a page interacts (either supports or negates) with another part of the page.  Sometimes there's a positive reinforcement, where the two parts work together well, either by echoing the same sales sentiment or by both showing the same product, or whatever.  Sometimes, there's a disconnect between one part and another (e.g. a headline and a picture may not work well together).  This is called an interaction - where one variable reacts with another - and I explain this in more detail.

Too Big Data

Too big to be useful?  To be informative?  It's one thing to collect a user's name, address, blood type, inside leg measurement and eye colour, but what's the point?  It all comes back to one thing:  actionable insights.

The current online political topic:  how much information are web analysts and marketers allowed to collect and use?  I start with an offline parallel and then discuss whether we're becoming overly paranoid about online data collection.

What is Direct Traffic?

After a year of not blogging about web analytics (it was a busy year), I return with an article about a topic I have thought about for a long time.  Direct traffic is described by some people as some of the best traffic you can get, but my experiences have taught me that it can be very different from the 'success of offline or word-of-mouth marketing'.  In fact, it can totally ruin your analysis - here's my view.

Testing - Iterating or Creating?
Having mentioned iterative testing before, I write here about the difference between planned iterative testing, and planned creative testing.  I explain the potential risks and rewards of creative testing (trying something completely new) versus the smaller risks and rewards of iterative testing (improving on something you tested before).

And finally...

A/B testing - where to test
This will form part of a series - I've looked at why we test, and now this is where.  I'll also be looking at how long to test for, and what to test next!

It's been a very exciting two years... and I'm looking forward to learning and then writing more about testing and analytics in the future!