Wednesday, 16 July 2014

When to Quit Iterative Testing: Snakes and Ladders

I have blogged a few times about iterative testing, the process of using one test result to design a better test and then repeating the cycle of reviewing test data and improving the next test.  But there are instances when it's time to abandon iterative testing, and play analytical snakes and ladders instead.  Surely not?  Well, there are some situations where iterative testing is not the best tool (or not a suitable tool) to use in online optimisation, and it's time to look at other options.  I have identified three examples where iterative testing is totally unsuitable:

1.  You have optimised an area of the page so well that you're now seeing the law of diminshing returns - your online testing is showing smaller and smaller gains with each test and you're reaching the top of the ladder.
2.  The business teams have identified another part of the page or site that is a higher priority than the area you're testing on.
3.  The design teams want to test something game-changing, which is completely new and innovative.

This is no bad thing.

After all, iterative testing is not the be-all-and-end-all of online optimization.  There are other avenues that you need to explore, and I've mentioned previously the difference between iterative testing and creative testing.  I've also commented that fresh ideas from outside the testing program (typically from site managers who have sales targets to hit) are extremely valuable.  All you need to work out is how to integrate these new ideas into your overall testing strategy.  Perhaps your testing strategy is entirely focused on future-state (it's unlikely, but not impossible). Sometimes, it seems, iterative testing is less about science and hypotheses, and more like a game of snakes and ladders.

Let's take a look at the three reasons I've identified for stopping iterative testing.

1.  It's quite possible that you reach the optimal size, colour or design for a component of the page.  You've followed your analysis step by step, as you would follow a trail of clues or footsteps, and it's led you to the top of a ladder (or a dead end) and you really can't imagine any way in which the page component could be any better.  You've tested banners, and you know that a picture of a man performs better than a woman, that text should be green, the call to action button should be orange and that the best wording is "Find out more."  But perhaps you've only tested having people in your banner - you've never tried having just your product, and it's time to abandon iterative testing and leap into the unknown.  It's time to try a different ladder, even if it means sliding down a few snakes first.

2.  The business want to change focus.  They have sales performance data, or sales targets, which focus on a particular part of the catalogue:  men's running shoes; ladies' evening shoes, or high-performance digital cameras.  Business requests can change far more quickly than test strategies, and you may find yourself playing catch-up if there's a new priority for the business.  Don't forget that it's the sales team who have to maintain the site, meet the targets and maximise their performance on a daily basis, and they will be looking for you to support their team as much as plan for future state.  Where possible, transfer the lessons and general principles you've learned from previous tests to give yourself a head start in this new direction - it would be tragic if you have to slide down the snake and start right at the bottom of a new ladder.

3.  On occasions, the UX and design teams will want to try something futuristic, that exploits the capabilities of new technology (such as Scene 7 integration, AJAX, a new API, XHTML... whatever).  If the executive in charge of online sales, design or marketing has identified or sponsored a brand new online technology that will probably revolutionise your site's performance, and he or she wants to test it, then it'll probably get fast-tracked through the tesing process.  However, it's still essential to carry out due diligence in the testing process, to make sure you have a proper hypothesis and not a HIPPOthesis.  When you test the new functionality, you'll want to be able to demonstrate whether or not it's helped your website, and how and why.  You'll need to have a good hypothesis and the right KPIs in place.  Most importantly - if it doesn't do well, then everybody will want to know why, and they'll be looking to you for the answers.  If you're tracking the wrong metrics, you won't be able to answer the difficult questions.

As an example, Nike have an online sports shoe customisation option - you can choose the colour and design for your sports shoes, using an online palette and so on.  I'm guessing that it went through various forms of testing (possibly even A/B testing) and that it was approved before launch.  But which metrics would they have monitored?  Number of visitors who tried it?  Number of shoes configured?  Or possibly the most important one - how many shoes were purchased?  Is it reasonable to assume that because it's worked for Nike, that it will work for you, when you're looking to encourage users to select car trim colours, wheel style, interior material and so on?  Or are you creating something that's adding to a user's workload and making it less likely that they will actually complete the purchase?

So, be aware:  there are times when you're climbing the ladder of iterative testing that it may be more profitable to stop climbing, and try something completely different - even if it means landing on a snake!

Friday, 11 July 2014

Is Multi-Variate Testing Really That Good?

The second discussion that I led at the Digital Analytics Hub in Berlin in June was entitled, "Is Multi Variate Testing Really That Good?"  Although only a few delegates attended, it got some good participation from a range of people representing a range of analytical and digital professionals, and in this post I'll cover some of the key points.

- The number of companies using MVT is starting to increase, although it's a very slow increase and it still has only low adoption rates. It's not as widespread as perhaps the tool vendors would suggest.

- The main barriers (real or perceived) to MVT are complexity (in design and analysis) and traffic volumes (multiple recipes require large volumes of traffic in order to get meaningful results in a useful timeframe).

There is an inherent level of complexity in MVT, as I've mentioned before (and one day soon I will explain how to analyse the results) and the tool vendors seem to imply that the test design must also be complicated.  It doesn't.  I've mentioned in a previous post on MVT that sometimes the visual design of a multi-variate test does not have to be complicated, it can just involve a large number of small changes run simultaneously.   

The general view during the discussion was that MVT would have to involve a complicated design with a large number of variations per element (e.g. testing a call-to-action button in red, green, yellow, orange and blue, with five different wordings).  In my opinion, this would be complicated as an A/B/n test, so as an MVT it would be extremely complex, and, to be honest, totally unsuitable for an entry-level test.

We spent a lot of our discussion time discussing various pages and scenarios where MVT is totally unsuitable, such as site navigation.  A number of online sites have issues with large catalogues and navigation hierarchies, and it's difficult to decide how best to display the whole range of products - MVT isn't the tool to use here, we discussed card-sorting, brainstorming and visualisations instead of A/B testing.  This was one of the key lessons for me - MVT is a powerful tool, but sometimes, you don't need a powerful tool, you just need the basic one.  A power drill is never going to be good at cutting wood - a basic handsaw is the way to go.  It's all about selecting the right tool for the job.

Looking at MVT, as with all online optimisation programs, the best plan is to build up to a full MVT in stages, with initial MVT trials run as pilot experiments.  Start with something where the basic concept for testing is easy to grasp, even if the hypothesis isn't great.  The problem statement or hypothesis could be, "We believe MVT is a valuable tool and in order to use it, we're going to start with a simple pilot as a proof of concept."  And why not? :-)

Banners are a great place to start - after all, the marketing team spend a lot of money on it, and there's nothing quite as eye-catching as a screenshot of a banner in your test report documents and presentations.  They're also very easy to explain... let's try an example.  Three variables that can be considered are gender of the model (man or woman), wording of the banner text ("Buy now" vs "On Sale") and the colour of the text (black or red).

There are eight possible combinations in total; here are a few potential recipes:

Recipe A
Recipe B
Recipe C
Recipe D

Note that I've tried to keep the pictures similar - model is facing camera, smiling, with a blurred background.  This may be a multi-variate test, but I'm not planning to change everything, and I'm keeping track of what I'm changing and what's staying the same!!

Designing a test like this has considerable benefits: 
- it's easy to see what's being tested (no need to play 'spot the difference')
- you can re-use the same images for different recipes
- copywriters and merchandisers only need to come up with two lots of copy (which will be less than in an A/B/C/D test with multiple recipes).
- it's not going to take large numbers of recipes, and therefore is NOT going to require a large volume of traffic.

Some time soon, I'll explain how to analyse and understand the results from a multi-variate test, hopefully debunking the myths around how complicated it is.

Image credits: 
man  -
woman - 

Wednesday, 9 July 2014

Why Test Recipe KPIs are Vital

Imagine a straightforward A/B test, between a "red" recipe and a "yellow" recipe.  There are different nuances and aspects to the test recipes, but for the sake of simplicity the design team and the testing team just codenamed them "red" and "yellow".  The two test recipes were run against each other, and the results came back.  The data was partially analysed, and a long list of metrics was produced.  Which one is the most important?  Was it bounce rate? Exit rate? Time on page?  Does it really  matter?

Let's take a look at the data, comparing the "yellow" recipe (on the left) and the "red" recipe (on the right).


As I said, there's a large number of metrics.  And if you consider most of them, it looks like it's a fairly close-run affair.  

The yellow team on the left had
28% more shots
8.3% more shots on target
22% fewer fouls (a good result)
Similar possession (4% more, probably with moderate statistical confidence)
40% more corners
less than half the number of saves (it's debatable whether more or fewer saves is better, especially if you look at the alternative to a save)
More offsides and more yellow cards (1 vs 0).

So, by most of these metrics, the yellow team (or the yellow recipe) had a good result.  They might even have done better.

However, the main KPI for this test is not how many shots, or shots on target.  The main KPI is goals scored, and if you look at this one metric, you'll see a different picture.  The 'red' team (or recipe) achieved seven goals, compared to just one for the yellow team.

In A/B testing, it's absolutely vital to understand in advance what the KPI is.  Key Performance Indicators are exactly that:  key.  Critical.  Imperative. There should be no more than two or three KPIs and they should match closely to the test plan which in turn, should come from the original hypothesis.  If your test recipe is designed to reduce bounce rate, there is little point in measuring successful leads generated.  If you're aiming for improved conversion, why should you look at time on page?  These other metrics are not-key performance indicators for your test.

Sadly, Brazil's data on the night was not sufficient for them to win - even though many of their metrics from the game were good, they weren't the key metrics.  Maybe a different recipe is needed.