Thursday, 12 March 2015

Pitfalls of Online Optimisation and Testing 3: Discontinuous Testing

Some forms of online testing are easy to set up, easy to measure and easy to interpret.  The results from one test point clearly to the next iteration, and you know exactly what's next.  For example, if you're testing the height of a banner on a page, or the size of the text that you use on your page headlines, there's a clear continuous scale from 'small' to 'medium' to 'large' to 'very large'.  You can even quantify it, in terms of pixel dimensions.  With careful testing, you can identify the optimum size for a banner, or text, or whatever it may be.  I would describe this as continuous testing, and it lends itself perfectly to iterative testing.

Some testing - in fact, most testing - is not continuous.  You could call it discrete testing, or digital testing, but I think I would call it discontinuous testing.

For example:
colours (red vs green vs black vs blue vs orange...)
title wording ("Product information" vs "Product details" vs "Details" vs "Product specification")
imagery (man vs woman vs family vs product vs product-with-family vs product alone)

Both forms of testing are, of course, perfectly valid.  The pitfall comes when trying to iterate on discontinuous tests, or trying to present results, analysis and recommendations to management.  The two forms can become confused, and unless you have a strong clear understanding of what you were testing in the first place - and WHY you tested it - you can get sidetracked into a testing dead-end. 

For example; let's say that you're testing how to show product images on your site.  There are countless ways of doing this, but let's take televisions as an example.  On top right is an image borrowed from the Argos website; below right is one from Currys/PC World. The televisions are different, but that's not relevant here; I'm just borrowing the screenfills and highlighting them as the main variable.  In 'real life' we'd test the screenfills on the same product.

Here's the basis of a straightforward A/B test - on the left, "City at Night" and on the right, "Winter Scene".  Which wins? Let's suppose for the sake of argument that the success metrics is click-through rate, and "City at Night" wins.  How would you iterate on that result, and go for an even better winner?  It's not obvious, is it?  There are too many jumps between the two recipes - it's discontinuous, with no gradual change from city to forest.

The important thing here (I would suggest) is to think beforehand about why one image is likely to do better than the other, so that when you come to analyse the results, you can go back to your original ideas and determine why one image won and the other lost.  In plain English:  if you're testing "City at Night" vs "Winter Scene", then you may propose that "Winter Scene" will win because it's a natural landscape vs an urban one.  Or perhaps "City at Night" is going to win because it showcases a wider range of colours.  Setting out an idea beforehand will at least give you some guidance on how to continue.

However, this kind of testing is inherently complex - there are a number of reasons why "City at Night" might win:
- more colours shown on screen
- showing a city line is more high-tech than a nature scene

- stronger feeling of warmth compared to the frozen (or should that be Frozen) scene

In fact, it's starting to feel like a two-recipe multi-variate test; our training in scientific testing says, "Change one thing at a time!" and yet in two images we're changing a large number of variables.  How can we unpick this mess?

I would recommend testing at least two or three test recipes against control, to help you triangulate and narrow down the possible reasons why one recipe wins and another loses. 

Shown on the right are two possible examples for a third and fourth recipe which might start to narrow down the reasons, and increase the strength of your hypothesis.
 If the hypothesis is that "City at Night" did better because it was an urban scene instead of a natural scene, then "City in Daylight" (top right) may do even better.  This has to be discontinuous testing - it's not possible to test the various levels of urbanisation; we have to test various steps along the way in isolation.

Alternatively, if "City at Night" did better because it showcased more colours, then perhaps "Mountain View" would do better - and if "Mountain View" outperforms "Winter Scene", where the main difference is the apparent temperature of the scene (warm vs cold), then warmer scenes do better, and a follow-up would be a view of a Caribbean holiday resort. And there you have it - perhaps without immediately realising, the test results are now pointing towards an iteration with further potential winners. 

By selecting the test recipes carefully and thoughtfully and deliberately aiming for specific changes between them, it's possible to start to quantify areas which were previously qualitative.  Here, for example, we've decided to focus (or at least try to focus) on the type of scene (natural vs urban) and on the 'warmth' of the picture, and set out a scale from frozen to warm, and from very natural to very urban.  Here's how a sketch diagram might look:

Selecting the images and plotting them in this way gives us a sense of direction for future testing.  If the city scenes both outperform the natural views, then try another urban scene which - for example - has people walking on a busy city street.  Try another recipe set in a park area - medium population density - just to check the original theory.  Alternatively, if the city scenes both perform similarly, but the mountain view is better than the winter scene (as I mentioned earlier), then try an even warmer scene - palm trees and a tropical view.

If they all perform exactly similarly, then it's time to try a different set of axes (temperature and population density don't seem to be important here, so it's time to start brainstorming... perhaps pictures of people and racing cars are worth testing?).

Let's take another example:  on-page text.  How much text is too much text, and what should you say? How should you greet users, what headlines should you use?  Should you have lengthy paragraphs discussing your product's features, or should you keep it short and concise - bullet points with the product's main specifications?

Which is better, A or B?  And (most importantly) - why?  (Blurb borrowed and adapted from Jewson Tools)


Cordless drills give you complete flexibility without compromising on power or performance.  We have a fantastic range, from leading brands such as AEG, DeWalt, Bosch, Milwaukee and Makita.  This extensive selection includes tools with various features including adjustable torque, variable speeds and impact and hammer settings. We also sell high quality cordless sets that include a variety of tools such as drills, circular saws, jigsaws and impact drivers. Our trained staff in our branches nationwide can offer expert technical advice on choosing the right cordless drill or cordless set for you.

* Cordless drills give you complete flexibility without compromising on power or performance.
* We stock AEG, DeWalt, Bosch, Milwaukee and Makita
* Selection includes drills with adjustable torque, variable speeds, impact and hammer settings
* We also stock drills, circular saws, jigsaws and impact drivers
* Trained staff in all our stores, nationwide

If A was to win, would it because of its readability?  Is B too short and abrupt?  Let's add a recipe C and triangulate again:

* Cordless drills - complete flexibility

* Uncompromised performance with power
* We stock AEG, DeWalt, Bosch, Milwaukee and Makita
* Features include adjustable torque, variable speed, impact and hammer settings
* We stock a full range of power tools
* Nationwide branches with trained staff

 C is now extremely short - reduced to sub-sentence bullet points.  By isolating one variable (the length of the text) we can hope to identify which is best - and why.  If C wins, then it's time to start reducing the length of your copy.  Alternatively, if A, B and C perform equally well, then it's time to take a different direction.  Each recipe here has the same content and the same tone-of-voice (it just says less in B and C); so perhaps it's time to add content and start to test less versus more.

* Cordless drills - complete flexibility with great value

* Uncompromised performance with power
* We stock AEG, DeWalt, Bosch, Milwaukee and Makita
* Features include adjustable torque, variable speed, impact and hammer settings
* We stock a full range of power tools to suit every budget
* Nationwide branches with trained and qualified staff to help you choose the best product
* Full 30-day warranty
* Free in-store training workshop  

* Cordless drills provide complete flexibility

* Uncompromised performance
* We stock all makes

* Wide range of features

* Nationwide branches with trained staff

In recipe D, the copy has been extended to include 'great value'; 'suit every budget', training and warranty information - the hypothesis would be that more is more, and that customers want this kind of after-sales support.  Maybe they aren't - maybe your customers are complete experts in power tools, in which case you'll see flat or negative performance.  In Recipe E, the copy has been cut to the minimum - are readers engaging with your text, or is it just there to provide context to the product imagery?  Do they already know what cordless drills are, what they do, and are they just buying another one for their team?

So, to sum up:  it's possible to apply scientific and logical thinking to discontinuous testing - the grey areas of optimisation.  I'll go for a Recipe C/E approach to my suggestions:

*  Plan ahead - identify variables (list them all)

*  Isolate variables as much as possible and test one or two
*  Identify the differences between recipes 
*  Draw up a continuum on one or two axes, and plot your recipes on it
*  Think about why a recipe might win, and add another recipe to test this theory (look at the continuum)

The previous articles in the Pitfalls of Online Optimisation and Testing series:
Article 2: So your results really are flat - why?  
Article 1:  Are your results really flat?

No comments:

Post a Comment