Wednesday, 10 September 2014

How to set up and analyse a multi-variate test

I've written at length about multi-variate tests.  I've discussed barriers, complexity and design, and each time, I've concluded by saying that I would write an article about how to analyse the results from a multi variate test.  This is that article.

I'm going to use the example I set up last time:  testing the components of a banner to optimise its effectiveness.  The success metric has been decided and it's click-through rate (for the sake of argument).

There are three components that are going to be tested:
- should the picture in the banner be a man or a woman?
- should the text in the banner say "On Sale!" or "Buy now!"
- should the text be black or red?

Here are a few example recipes from my previous post on MVT.


Recipe 1
Recipe 2
Recipe 3
Recipe 4

Recipe selection and test plan

When there are three components with two options for each, the total number of possible recipes is 2^3 = 8 recipes.  However, by using MVT, we can run just four recipes and through analysis identify which of the combinations is the best (whether it was one of the original four we tested, or one that we didn't test), and we do this by looking at the effect each component has.  The effect of each component is often called the element contribution.


In order to run the multi-variate test with four recipes (instead of an A/B/n test with all eight recipes) we need to carefully select the recipes we run - we can't just pick four at random.  We need to make sure that the four recipes cover each variation of each element.  for example, the set of four shown above (A-D) does not have a version with a red 'On Sale!' element, so we can't compare red against black.  It is possible to run a multi-variate test to cover 2^3 combinations with just four recipe, but we'll need to be slightly more selective.  Using mathematical langugage, the set of recipes that we need to use have to be orthogonal (i.e. they "point" in different directions - in geometry, 90 degrees difference - so have almost nothing in common). In IT circles, it would be called orthogonal array testing (warning: the Wikipedia entry is full of technical vocabulary).

Many tools will identify the set of recipes to test - Adobe's Test and Target does this, for example; alternatively, I'm sure that your account manager with your tool provider will be able to work with you to identify the set you need.


Here, then are the full set of eight recipes that I could have for my MVT, and the four recipes that I would need to run on my site:

The full set of eight recipes
Recipe Gender Colour Wording
S Man Red Sale
T Man Red Buy
U Man Black Sale
V Man Black Buy
W Woman Red Sale
X Woman Red Buy
Y Woman Black Sale
Z Woman Black Buy

The recipes highlighted in bold represent one possible set of four recipes that would form a successful MVT set.  There are others (for example, those not highlighted in bold are a complete set too).

An example set of four recipes that could be tested
successfully

Recipe Gender Colour Wording
A Man Red Sale
B Man Black Buy
C Woman Red Buy
D Woman Black Sale

Notice that in the full set of eight recipes, each variation (man or woman, red or black, sale or buy) appears four times each.  In the subset of four recipes to be tested, each variation appears twice, and this confirms that the subset is suitable for testing.

The visuals for the four approved test recipes are:


Recipe A
RecipeB
Recipe C
Recipe D

And we can see by inspection that the four recipes do indeed have two with the man, two with the woman; two with red text and two with black; two with "Buy Now!" and two with "On Sale!"


The next step is to run the test as if it were an A/B/C/D test - with one difference:  it's quite possible that one or more of the four test recipes may do very badly (or very well) compared to the others.  However, it's highly recommended (but not essential) that you run all four recipes for the same length of time, and allow them to obtain equal numbers of traffic.  In an MVT test run, it's important to have a large enough population of visitors for each recipe - it's not just about running until one of the four is signficantly better (or worse) than the others and calling a winner.

Analysis

Let's assume that we've run the test, and obtained the following data:

Recipe A B C D
Gender Man Woman Woman Man
Wording Buy Now Buy Now On Sale On Sale
Colour Black Red Black Red
Impressions 1010 1014 1072 1051
Clicks 341 380 421 291
Click-through rate 34% 37% 41% 28%

It looks from these results as if the winner is Recipe C; the picture of the woman, with black text saying, "On Sale!".  However, there are four other recipes that we didn't test, but we can infer their relative performance by doing some judicious arithmetic with the data we have.

To begin with, we can identify which colour is better, black or red, by comparing the two recipes which have black text against the two recipes which have red text.


This might seem dangerous or confusing, but let's think about it.  The two recipes which have black text are A and C.  For recipe A, we have a man with "Buy Now!" and for recipe C, we have a woman with "On Sale!".  The net result of combining recipe A and C is to isolate everybody who saw black text, with the other elements being reduced to noise (no net contribution from either element).  This  works logically when we compare A and C with the combination of B and D.  B and D both have red text, but half have a man and half have a woman; half have "On Sale!" and half have "Buy Now!".  The consequence of this is that we can isolate the effect of black text against red text - the other factors are reduced to noise.


We could think of this mathematically, using simple expressions:

A+C = (Man + Buy Now + Black) + (Woman + On Sale + Black)
A+C = Man + Woman + Buy Now + On Sale + 2xBlack


B+D =(Woman + Buy Now + Red) + (Man + On Sale + Red)
B+D = Man + Woman + Buy Now + On Sale + 2xRed

Subtracting one from the other, and cancelling like terms...
A+C - B+D = 2xBlack - 2xRed

When we compare A+C and B+D, we get this:

Recipe A+C (black) B+D (red)
Total impressions 2082 2065
Total clicks 762 671
CTR 36.6% 32.5%

So we can see that A+B (black) is better than C+D (red) - and we can attribute an element contribution of +12.63% to the colour black.

We can also do the maths to obtain the best gender and wording:

Gender:  A+D = man, B+C = woman
Recipe A+D B+C
Total impressions 2061 2086
Total clicks 632 801
CTR 30.7% 38.4%
Result:  woman is 25.2% better than man (on CTR in this test ;-) )

Wording: A+B = Buy Now, C+D = On Sale
Recipe A+B C+D
Total impressions 2024 2123
Total clicks 721 712
CTR 35.6% 33.5%
Result:  Buy Now is 6.22% better than On Sale


Summarising our results:

Result:  black is 12.63% better than red
Result:  woman is 25.2% better than man

Result:  Buy Now is 6.22% better than On Sale

The winner!
The winning combination is black, buy now with woman, which is one that we didn't actually include in our test recipes.  The recommended follow-up is to test the winning recipe from the four that we did test against the proposed winner from the analysis we've just done.  Where that isn't possible, for whatever reason, you could test your existing control design against the proposed winner.  Alternatively, you could just go implement the theoretical winner without testing - it's up to you.

A brief note on the analysis:  this shows the importance of keeping all test recipes running for an equal length of time, so that they receive approximatley equal volumes of traffic.  Here, recipes A, B, C and D all received around 1000 impressions, but if one of them had significantly fewer (because it was switched off early because it "wasn't performing well") then that recipe would not have an equal weighting in the calculations where we compared the pairs of recipes, and its perceived performance would be higher than its actual.



I hope I've been able to show in this article (and the previous one) how it's possible to set up and analyse a multi-variate test, starting with the principles of identifying the variables you want to test, then establishing which recipes are required, and then showing how to analyse the results you obtain.


---

Image credits: 
man  - http://www.findresumetemplates.com/job-interview
woman - http://www.sheknows.com/living 

Thursday, 28 August 2014

Telling a Story with Web Analytics Data

Management demands actionable insights - not just numbers, but KPIs, words, sentences and recommendations.  It's therefore essential that we, as web analysts and optimisers, are able to transform data into words - and better still, stories.  Consider a report with too much data and too little information - it reads like a science report, not a business readout:

Consider a report concerning four main characters;
Character A: female, aged 7 years old.  Approximately 1.3 metres tall.
Character B:  male, aged 5 years old.
Character C: female, aged 4 years old.
Character D:  male, aged 1 year old.

The main items in the report are a small cottage, a 1.2 kw electric cooker, 4 pints of water, 200 grams of dried cereal and a number of assorted iron and copper vessels, weighing 50-60 grams each.

After carrying out a combination of most of the water and dried cereal, and in conjunction with the largest of the copper vessels, Character B prepared a mixture which reached around 70 degrees Celsius.  He dispensed this unevenly into three of the smaller vessels in order to enable thermal equilibrium to be formed between the mixture and its surroundings.  Characters B, C and D then walked 1.25 miles in 30 minutes, averaging just over 4 km/h.  In the interim, Character A took some empirical measurements of the chemical mixture, finding Vessel 1 to still be at a temperature close to 60 degrees Celsius, Vessel 2 to be at 70 degrees Fahrenheit and Vessel 3 to be at 315 Kelvin, which she declares to be optimal.

The report continues with Character A consuming all of the mixture in Vessel 3, then single-handedly testing (in some case destruction testing) much of the furniture in the small cottage.

The problem is:  there's too much data and not enough information. 

The information is presented in various formats - lists, sentences and narrative.


Some of it the data is completely irrelevant (the height of Character A, for example)
Some of it is misleading (the ages of the other characters lacks context);
Some of it is presented in a mish-mash of units (temperatures are stated four times, with three different units).
The calculation of the speed of the walking characters is not clear - the distance is given in miles; the time is given in minutes; and the speed in kilometres per hour (if you are familiar with the abbreviation km/h).

Of course, this is an exaggeration, and as web analytics professionals, we wouldn't do this kind of thing in our reporting. 

Visitors are called visitors, and we consistently refer to them as visitors (and we ensure that this definition is understood among our readers)
Conversion rates are based on visitors, even though this may require extra calculation since our tools provide figures based on visits (or sessions)
Percentage of traffic coming from search is quoted as visitors (not called users), and not visits (whether you use visitors or visits is up to you, but be consistent)
Would you include number of users who use search?  And the conversion rate for users of search?
And when you say 'Conversion', are you consistenly talking about 'user added an item to cart', or 'user completed a purchase and saw the thank-you page'?
Are you talking about the most important metrics?
 
Too much data, not enough information?
So - make sure, for starters, that your units and data and KPIs are consistent, contextual, or at least make sense. And then:  add the words to the numbers.  It's only the start to say that: "We attracted 500 visitors with paid search, at a total cost of £1,200."  Go on to talk about the cost per visitor, break it down into key details by talking about the most expensive keywords, and the ones that drove the most traffic.  But then tell the story:  there's a sequence of events between user seeing your search term, clicking on your ad, visiting your site, and [hopefully] converting.  Break it down into chronological steps and tell the story!

There are various ways to ensure that you're telling the story; my favourites are to answer these types of questions:
"You say that metric X has increased by 5%.  Is that a lot?  Is that good?"
 "WHY has this metric gone up?"
"What happened to our key site performance indicators (profit, revenue, conversion) as a result?"
and my favourite
"What should we do about it?"

Character A
There are, of course, various ways to hide the story, or disguise results that are not good (i.e. do not meet sales or revenue targets) - I did this in my anecdote at the start. However, management tend to start looking at incomplete data, or data that's obscure or irrelevant, and go on to ask about the data that's "missing"... so the truth will out, so it's better to show the data, tell the whole story, and highlight why things are below par. 

It's our role to highlight when performance is down - we should be presenting the issues (nobody else has the tools to do so) and then going on to explain what needs to be done - this is where actionable insights become invaluable.  In the end, we present the results and the recommendations and then let the management make the decision - I blogged about this some time ago - web analytics: who holds the steering wheel?

In the case of Characters A, B, C and D, I suggest that Characters B and C buy a microwave oven, and improve their security to prevent Character A from breaking into their house and stealing their breakfast.  In the case of your site, you'll need to use the data to tell the story.

Thursday, 14 August 2014

I am a power-tool A/B skeptic

I have recently enjoyed reading Peter W Szabo's article entitled, "I am an A/B testing skeptical."  Sure, it's a controversial title (especially when he shared it in the LinkedIn Group for web analysts and optimisers), but it's thought-provoking nonetheless.

And reading it has made me realise:  I am a power-drill skeptic. I've often wondered what the benefit of having the latest Black and Decker power tool might actually be.  After all, there are plenty of hand drills out there that are capable of drilling holes in wood, brick (if you're careful) and even metal sheet. The way I see it, there are five key reasons why power drills are not as good as hand-drilling (and I'm not going to discuss the safety risks of holding a high-powered electrical device in your hand, or the risks of flying dust and debris).


5.  There's no consistency in the size of hole I drill.

I can use a hand drill and by watching how hard I press and how quickly I turn the handle, I can monitor the depth and width of the hole I'm drilling.  Not so with a power drill - sometimes it flies off by itself, sometimes it drills too slowly.  I have read about this online, and I've watched some YouTube videos.  I have seen some experienced users (or professionals, or gurus, or power users) drill a hole which is 0.25 ins diameter and 3 ins deep, but when I try to use the same equipment at home, I find that my hole is much wider (especially at the end) and often deeper.  Perhaps I'm drilling into wood and they're drilling into brick? Perhaps I'm not using the same metal bits in my power drill?  Who knows?



4.  Power drill bits wear out faster.

Again, in my experience, the drill bits I use wear out more quickly with a power drill.  Perhaps leaving them on the side isn't the best place for them, especially in a damp environment.  I have found that my hand drill works fine because I keep it in my toolbox and take care of it, but having several drill bits for my power tool means I don't have space or time to keep track of them all; what happens is that I often try to drill with a power-drill bit that's worn out and a little bit rusty, and the results aren't as good as when the drill bits were new.  The drill bits I buy at Easter are always worn out and rusty by Christmas.

The professionals always seem to be using shiny new tools and bits, but not me.  But, as I said, this hasn't been a problem previously because having one hand-drill with only a small selection of bits has made it easier to keep track of them.  That's a key reason why power tools aren't for me. 


3.  Most power drills are a waste of time.

Power drills are expensive, especially when compared to the hand tool version.  They cost a lot of money, and what's the most you can do with them?  Drill holes.  Or, with careful use, screw in screws.  No, they can't measure how deep the hole should be, or how wide.  Some models claim to be able to tell you how deep your hole is while you're drilling it, but that's still pretty limited.  When I want to put up a shelf, I end up with a load of holes in a wall that I don't want, but that's possibly because I didn't think about the size of the shelf, the height I wanted it or what size of plugs I need to put into the wall to get my shelf to stay up (and remain horizontal).  Maybe I should have measured the wall better first, or something.
Measure twice, drill once.
2.  I always need more holes 

As I mentioned with power drills being a waste of time, I often find that compared to the professionals I have to drill a lot more holes than usual.  They seem to have this uncanny ability to drill the holes in exactly the right places (how do they do that?) and then put their bookshelves up perfectly.  They seem to understand the tools they're using - the drill, the bits, the screws, the plugs, the wall - and yet when I try to do this with one of their new-fangled power-drills, I end up with too many holes.  I keep missing what I'm aiming for; perhaps I need more practice.  As it is, when I've finished one hole, I can often see how I could make it better and what I need to do, and get closer and closer with each of the subsequent holes I drill.  Perhaps the drill is just defective?

1.  Power drills will give you holes, but they won't necessarily be the right size

 
This pretty much sums up power drills for me, and the largest flaw that's totally inherent in power tools.  I've already said that they're only useful for drilling holes, and that the holes are often too wide, too short and in the wrong place. In some cases, when one of my team has identified that the holes are in the wrong place, they've been able to quickly suggest a better location - only to then find that that's also incorrect, and then have two wrong holes and still no way of completing my job.  It seems to me that drilling holes and putting up bookshelves (or display shelving, worse still) is something that's just best left to the professionals.  They can afford the best drill bits and the most expensive drills, and they also have the money available to make so many mistakes - it's clear to me that they must have some form of Jedi mind power, or foreknowledge of the kinds of holes they want to drill and where to drill them. 


In conclusion:

Okay, you got me, perhaps I am being a little unkind.  There are a lot of web analytics and A/B professionals out there, but there is also a large number of amateurs who want to try their hand at online testing and who get upset or confused when it doesn't work out.  Like any skilled profession, if you want to do analytics and optimisation properly, you can be sure it's harder than it looks (especially if it looks easy).  It takes time and thought to run a good test (let alone build up a testing program) and to make sure that you're hitting the target you're aiming for.  Why are you running this test?  It takes more than just the ability to split traffic between two or more designs to run a test.

Yes, I've parodied Peter W Szabo's original article, but that's because it seemed to me the easiest way to highlight some of the misconceptions that he's identified, and which exist in the wider online optimisation community - especially the ideas that 'tests will teach you useful things', and the underlying misconception that 'testing is quick and easy'.  I will briefly mention that you need a reason to run a test (just as you need a reason to drill a hole) and you need to do some analytical thinking (using other tools, not just testing tools) in the same way as you would use a spirit level, a pencil and a ruler when drilling a hole.

Drilling the hole in the wall is only one step in the process of putting up a bookshelf; splitting traffic in a test should be just one step in the optimisation process, and should be preceded by some serious thought and design work, and followed up with careful review and analysis.  Otherwise, you'll never put your shelf up straight, and your tests will never tell you anything.

Wednesday, 16 July 2014

When to Quit Iterative Testing: Snakes and Ladders



I have blogged a few times about iterative testing, the process of using one test result to design a better test and then repeating the cycle of reviewing test data and improving the next test.  But there are instances when it's time to abandon iterative testing, and play analytical snakes and ladders instead.  Surely not?  Well, there are some situations where iterative testing is not the best tool (or not a suitable tool) to use in online optimisation, and it's time to look at other options.  I have identified three examples where iterative testing is totally unsuitable:

1.  You have optimised an area of the page so well that you're now seeing the law of diminshing returns - your online testing is showing smaller and smaller gains with each test and you're reaching the top of the ladder.
2.  The business teams have identified another part of the page or site that is a higher priority than the area you're testing on.
3.  The design teams want to test something game-changing, which is completely new and innovative.

This is no bad thing.

After all, iterative testing is not the be-all-and-end-all of online optimization.  There are other avenues that you need to explore, and I've mentioned previously the difference between iterative testing and creative testing.  I've also commented that fresh ideas from outside the testing program (typically from site managers who have sales targets to hit) are extremely valuable.  All you need to work out is how to integrate these new ideas into your overall testing strategy.  Perhaps your testing strategy is entirely focused on future-state (it's unlikely, but not impossible). Sometimes, it seems, iterative testing is less about science and hypotheses, and more like a game of snakes and ladders.

Let's take a look at the three reasons I've identified for stopping iterative testing.

1.  It's quite possible that you reach the optimal size, colour or design for a component of the page.  You've followed your analysis step by step, as you would follow a trail of clues or footsteps, and it's led you to the top of a ladder (or a dead end) and you really can't imagine any way in which the page component could be any better.  You've tested banners, and you know that a picture of a man performs better than a woman, that text should be green, the call to action button should be orange and that the best wording is "Find out more."  But perhaps you've only tested having people in your banner - you've never tried having just your product, and it's time to abandon iterative testing and leap into the unknown.  It's time to try a different ladder, even if it means sliding down a few snakes first.

2.  The business want to change focus.  They have sales performance data, or sales targets, which focus on a particular part of the catalogue:  men's running shoes; ladies' evening shoes, or high-performance digital cameras.  Business requests can change far more quickly than test strategies, and you may find yourself playing catch-up if there's a new priority for the business.  Don't forget that it's the sales team who have to maintain the site, meet the targets and maximise their performance on a daily basis, and they will be looking for you to support their team as much as plan for future state.  Where possible, transfer the lessons and general principles you've learned from previous tests to give yourself a head start in this new direction - it would be tragic if you have to slide down the snake and start right at the bottom of a new ladder.

3.  On occasions, the UX and design teams will want to try something futuristic, that exploits the capabilities of new technology (such as Scene 7 integration, AJAX, a new API, XHTML... whatever).  If the executive in charge of online sales, design or marketing has identified or sponsored a brand new online technology that will probably revolutionise your site's performance, and he or she wants to test it, then it'll probably get fast-tracked through the tesing process.  However, it's still essential to carry out due diligence in the testing process, to make sure you have a proper hypothesis and not a HIPPOthesis.  When you test the new functionality, you'll want to be able to demonstrate whether or not it's helped your website, and how and why.  You'll need to have a good hypothesis and the right KPIs in place.  Most importantly - if it doesn't do well, then everybody will want to know why, and they'll be looking to you for the answers.  If you're tracking the wrong metrics, you won't be able to answer the difficult questions.

As an example, Nike have an online sports shoe customisation option - you can choose the colour and design for your sports shoes, using an online palette and so on.  I'm guessing that it went through various forms of testing (possibly even A/B testing) and that it was approved before launch.  But which metrics would they have monitored?  Number of visitors who tried it?  Number of shoes configured?  Or possibly the most important one - how many shoes were purchased?  Is it reasonable to assume that because it's worked for Nike, that it will work for you, when you're looking to encourage users to select car trim colours, wheel style, interior material and so on?  Or are you creating something that's adding to a user's workload and making it less likely that they will actually complete the purchase?

So, be aware:  there are times when you're climbing the ladder of iterative testing that it may be more profitable to stop climbing, and try something completely different - even if it means landing on a snake!

Friday, 11 July 2014

Is Multi-Variate Testing Really That Good?

The second discussion that I led at the Digital Analytics Hub in Berlin in June was entitled, "Is Multi Variate Testing Really That Good?"  Although only a few delegates attended, it got some good participation from a range of people representing a range of analytical and digital professionals, and in this post I'll cover some of the key points.

- The number of companies using MVT is starting to increase, although it's a very slow increase and it still has only low adoption rates. It's not as widespread as perhaps the tool vendors would suggest.

- The main barriers (real or perceived) to MVT are complexity (in design and analysis) and traffic volumes (multiple recipes require large volumes of traffic in order to get meaningful results in a useful timeframe).

There is an inherent level of complexity in MVT, as I've mentioned before (and one day soon I will explain how to analyse the results) and the tool vendors seem to imply that the test design must also be complicated.  It doesn't.  I've mentioned in a previous post on MVT that sometimes the visual design of a multi-variate test does not have to be complicated, it can just involve a large number of small changes run simultaneously.   

The general view during the discussion was that MVT would have to involve a complicated design with a large number of variations per element (e.g. testing a call-to-action button in red, green, yellow, orange and blue, with five different wordings).  In my opinion, this would be complicated as an A/B/n test, so as an MVT it would be extremely complex, and, to be honest, totally unsuitable for an entry-level test.

We spent a lot of our discussion time discussing various pages and scenarios where MVT is totally unsuitable, such as site navigation.  A number of online sites have issues with large catalogues and navigation hierarchies, and it's difficult to decide how best to display the whole range of products - MVT isn't the tool to use here, we discussed card-sorting, brainstorming and visualisations instead of A/B testing.  This was one of the key lessons for me - MVT is a powerful tool, but sometimes, you don't need a powerful tool, you just need the basic one.  A power drill is never going to be good at cutting wood - a basic handsaw is the way to go.  It's all about selecting the right tool for the job.

Looking at MVT, as with all online optimisation programs, the best plan is to build up to a full MVT in stages, with initial MVT trials run as pilot experiments.  Start with something where the basic concept for testing is easy to grasp, even if the hypothesis isn't great.  The problem statement or hypothesis could be, "We believe MVT is a valuable tool and in order to use it, we're going to start with a simple pilot as a proof of concept."  And why not? :-)

Banners are a great place to start - after all, the marketing team spend a lot of money on it, and there's nothing quite as eye-catching as a screenshot of a banner in your test report documents and presentations.  They're also very easy to explain... let's try an example.  Three variables that can be considered are gender of the model (man or woman), wording of the banner text ("Buy now" vs "On Sale") and the colour of the text (black or red).

There are eight possible combinations in total; here are a few potential recipes:


Recipe A
Recipe B
Recipe C
Recipe D

Note that I've tried to keep the pictures similar - model is facing camera, smiling, with a blurred background.  This may be a multi-variate test, but I'm not planning to change everything, and I'm keeping track of what I'm changing and what's staying the same!!

Designing a test like this has considerable benefits: 
- it's easy to see what's being tested (no need to play 'spot the difference')
- you can re-use the same images for different recipes
- copywriters and merchandisers only need to come up with two lots of copy (which will be less than in an A/B/C/D test with multiple recipes).
- it's not going to take large numbers of recipes, and therefore is NOT going to require a large volume of traffic.

Some time soon, I'll explain how to analyse and understand the results from a multi-variate test, hopefully debunking the myths around how complicated it is.


Image credits: 
man  - http://www.findresumetemplates.com/job-interview
woman - http://www.sheknows.com/living 


Wednesday, 9 July 2014

Why Test Recipe KPIs are Vital

Imagine a straightforward A/B test, between a "red" recipe and a "yellow" recipe.  There are different nuances and aspects to the test recipes, but for the sake of simplicity the design team and the testing team just codenamed them "red" and "yellow".  The two test recipes were run against each other, and the results came back.  The data was partially analysed, and a long list of metrics was produced.  Which one is the most important?  Was it bounce rate? Exit rate? Time on page?  Does it really  matter?

Let's take a look at the data, comparing the "yellow" recipe (on the left) and the "red" recipe (on the right).

  

As I said, there's a large number of metrics.  And if you consider most of them, it looks like it's a fairly close-run affair.  

The yellow team on the left had
28% more shots
8.3% more shots on target
22% fewer fouls (a good result)
Similar possession (4% more, probably with moderate statistical confidence)
40% more corners
less than half the number of saves (it's debatable whether more or fewer saves is better, especially if you look at the alternative to a save)
More offsides and more yellow cards (1 vs 0).

So, by most of these metrics, the yellow team (or the yellow recipe) had a good result.  They might even have done better.

However, the main KPI for this test is not how many shots, or shots on target.  The main KPI is goals scored, and if you look at this one metric, you'll see a different picture.  The 'red' team (or recipe) achieved seven goals, compared to just one for the yellow team.

In A/B testing, it's absolutely vital to understand in advance what the KPI is.  Key Performance Indicators are exactly that:  key.  Critical.  Imperative. There should be no more than two or three KPIs and they should match closely to the test plan which in turn, should come from the original hypothesis.  If your test recipe is designed to reduce bounce rate, there is little point in measuring successful leads generated.  If you're aiming for improved conversion, why should you look at time on page?  These other metrics are not-key performance indicators for your test.

Sadly, Brazil's data on the night was not sufficient for them to win - even though many of their metrics from the game were good, they weren't the key metrics.  Maybe a different recipe is needed.

Tuesday, 24 June 2014

Why Does Average Order Value Change in Checkout Tests?

The first discussion huddle I led at the Digital Analytics Hub in 2014 looked at why average order value changes in checkout tests, and was an interesting discussion.  With such a specific title, it was not surprising that we wandered around the wider topics of checkout testing and online optimisation, and we covered a range of issues, tips, troubles and pitfalls of online testing.

But first:  the original question - why does average order value (AOV) change during a checkout test?  After all, users have completed their purchase selection, they've added all their desired items to the cart, and they're now going through the process of paying for their order.  Assuming we aren't offering upsells at this late stage, and we aren't encouraging users to continue shopping, or offering discounts, then we are only looking at whether users complete their purchase or not.  Surely any effect on order value should be just noise?

For example, if we change the wording for a call to action from 'Continue' to 'Proceed' or 'Go to payment details', then would we really expect average order value to go up or down?  Perhaps not.  But, in the light of checkout test results that show AOV differences, we need to revisit our assumptions.

After all, it's an oversimplification to say that all users are affected equally, irrespective of how much they're intending to spend.  More analysis is needed to look at conversion by basket value (cart value) to see how our test recipe has affected different users based on their cart value.  If conversion is affected equally across all price bands, then we won't see a change in AOV.  However, how likely is that?

Other alternatives:  perhaps there's no real pattern in conversion changes:  low-price-band, mid-price-band, high-price-band and ultra-high-price-band users show a mix of increases and decreases.  Any overall AOV change is just noise, and the statistical significance of the change is low.

But let's suppose that the higher price-band users don't like the test recipe, and for whatever reason, they decide to abandon.  The AOV for the test recipe will go down - the spread of orders for the test recipe is skewed to the lower price bands.  Why could this be?  We discussed various test scenarios:

- maybe the test recipe missed a security logo?  Maybe the security logo was moved to make way for a new design addition - a call to action, or a CTA for online chat - a small change but one that has had significant consequences.

- maybe the test recipe was too pushy, and users with high ticket items felt unnecessarily pressured or rushed?  Maybe we made the checkout process feel like express checkout, and we inadvertantly moved users to the final page too quickly.  For low-ticket items, this isn't a problem - users want to move through with minimum fuss and feel as if they're making rapid progress.  Conversely, users who are spending a larger amount want to feel reassured by a steady checkout process which allows the user to take time on each page without feeling rushed?

- sometimes we deliberately look to influence average order value - to get users to spend more, add another item to their order (perhaps it's batteries, or a bag, or the matching ear-rings, or a warranty).  No surprises there then, that average order value is influenced; sometimes it may go down, because users felt we were being too pushy.

Here's how those changes might look as conversion rates per price band, with four different scenarios:

Scenario 1:  Conversion (vertical axis) is improved uniformly across all price bands (low - very high), so we see a conversion lift and average order value is unchanged.

Scenario 2:  Conversion is decreased uniformly across all price bands; we see a conversion drop with no change in order value.

Scenario 3:  Conversion is decreased for low and medium price bands, but improved for high and very-high price bands.  Assuming equal order volumes in the baseline, this means that conversion is flat (the average is unchanged) but average order value goes up.

Scenario 4:  Conversion is improved selectively for the lowest price band, but decreases for the higher price bands.  Again, assuming there are similar order volumes (in the baseline) for each price band, this means that conversion is flat, but that average order value goes down.

There are various combinations that show conversion up/down with AOV up/down, but this is the mathematical and logical reason for the change.

Explaining why this has happened, on the other hand, is a whole different story! :-)