Wednesday, 19 June 2013

Why is yesterday's test winner today's loser?

This post comes out of the xChange Berlin huddle which I led on 11 June 2013.  xChange is very different from most web analytics conferences - most conferences have speakers and presentations, but xChange is focused around web analytics professionals meeting and discussing in small workshop groups.  As the xChange website describes it:
"Expressly designed for enterprise analytics managers and digital marketing and measurement practitioners, X Change brings together top professionals in the field in a no-sales, all business, peer-to-peer environment for deep-dives into cutting edge online measurement topics."

At xChange Berlin 2013, I led two huddle groups - this was the first, entitled, "Why is yesterday's test winner today's loser?".  I haven't attributed the content here to any particular participant - this is just a summary of our discussions.  I should say now that the discussion was not even close to what I'd anticipated, but was even more interesting as a result!

The discussion kicked off with a review of a test win.  
Let's suppose that you have run your A/B test, and you have a winner.  You ran it for long enough to achieve statistical significance and even achieved consistent trend lines.  But somehow, when you implemented it, your financial metrics didn't show the same level of improvement as your test results.  And now, the boss has come to your desk to ask if your test was really valid.  "What happened?  Why is yesterday's test winner today's loser?"

There are a number of reasons for this - let's take a look.

External factors
Yes, A/B tests split your traffic evenly between the test recipes, so that most external factors are accounted for.  But what happens if your test was running while you had a large-scale TV campaign, or display or PPC campaign?  Yes, that traffic would have been split between your test recipes, so the effect is - apparently - mitigated.  But what if the advertising campaign resonated with your test recipe, which went on to win.  During the non-campaign period, the control recipe would be better, or perhaps the results would have been more similar.  Consequently, the uplift that you saw during the test would not be achieved in normal conditions.

Customer Experience Changes
When we start a test, there is quite often a dip in performance for the test recipe.  It's new.  It's unfamiliar and users have to become accustomed to it.  It often takes a week or so for visitors to get used to it, and for accurate, meaningful and useful test results to develop.  In particular, frequent repeat visitors will take some time to adjust to the changes (how often repeat visitors return will depend on your site).  The same issue applies when you implement a winner - now, the whole population is seeing a new design, and it will take some time for them to adjust.

Visitor Segments
Perhaps the test recipe worked especially well with a particular visitor segment?  Maybe new visitors, or search visitors, or visitors from social media, and that was responsible for the uplift.  You have assumed (one way or another) that your population profile is fairly constant.  But if you identify that your test recipe won because one or two segments really engaged with it, then you may not see the uplift if your population profile changes.  What should you do instead?  Set up a targeting implementation: target specific visitors, based on your test results, who engaged more (or converted better) with the test recipe.  Show everybody else the same version of your site as usual, but for visitors who fit into a specific segment - show them the test recipe.  I'll discuss targeting again at a later date, but here's a post I wrote a few months ago about online personalisation.

Time lapse between test win and implementation
This varied around the members of the group - where a company has a test plan, and there's a need to get a test up and running, it may not be possible to implement straight away.  It also depends on what's being tested - can the test recipe be implemented immediately through the site team or CMS, or will it require IT roadmap work?  Most of the group would use either the testing software (for example, Test and Target, or Visual Website Optimiser) and immediately set a winning recipe to 100% traffic (or 95%) until the change could be made permanently.  Setting a winning recipe to 95% instead of 100% in effect enables the test to run for longer - you can continue to show that the test recipe is winning.  It also means that visitors who were in the control group during the test (i.e. saw "Recipe A") will continue to see that recipe until the implementation is complete - better customer experience for that group?  Something to think about!

My next post will be about the second huddle that I led, which was based on iterating vs creating.  The title came from my recent blog post on iterative testing, but the discussion went in a very different direction, and again, was better for it!

No comments:

Post a Comment