Wednesday, 10 September 2014

How to set up and analyse a multi-variate test

I've written at length about multi-variate tests.  I've discussed barriers, complexity and design, and each time, I've concluded by saying that I would write an article about how to analyse the results from a multi variate test.  This is that article.

I'm going to use the example I set up last time:  testing the components of a banner to optimise its effectiveness.  The success metric has been decided and it's click-through rate (for the sake of argument).

There are three components that are going to be tested:
- should the picture in the banner be a man or a woman?
- should the text in the banner say "On Sale!" or "Buy now!"
- should the text be black or red?

Here are a few example recipes from my previous post on MVT.

Recipe 1
Recipe 2
Recipe 3
Recipe 4

Recipe selection and test plan

When there are three components with two options for each, the total number of possible recipes is 2^3 = 8 recipes.  However, by using MVT, we can run just four recipes and through analysis identify which of the combinations is the best (whether it was one of the original four we tested, or one that we didn't test), and we do this by looking at the effect each component has.  The effect of each component is often called the element contribution.

In order to run the multi-variate test with four recipes (instead of an A/B/n test with all eight recipes) we need to carefully select the recipes we run - we can't just pick four at random.  We need to make sure that the four recipes cover each variation of each element.  for example, the set of four shown above (A-D) does not have a version with a red 'On Sale!' element, so we can't compare red against black.  It is possible to run a multi-variate test to cover 2^3 combinations with just four recipe, but we'll need to be slightly more selective.  Using mathematical langugage, the set of recipes that we need to use have to be orthogonal (i.e. they "point" in different directions - in geometry, 90 degrees difference - so have almost nothing in common). In IT circles, it would be called orthogonal array testing (warning: the Wikipedia entry is full of technical vocabulary).

Many tools will identify the set of recipes to test - Adobe's Test and Target does this, for example; alternatively, I'm sure that your account manager with your tool provider will be able to work with you to identify the set you need.

Here, then are the full set of eight recipes that I could have for my MVT, and the four recipes that I would need to run on my site:

The full set of eight recipes
Recipe Gender Colour Wording
S Man Red Sale
T Man Red Buy
U Man Black Sale
V Man Black Buy
W Woman Red Sale
X Woman Red Buy
Y Woman Black Sale
Z Woman Black Buy

The recipes highlighted in bold represent one possible set of four recipes that would form a successful MVT set.  There are others (for example, those not highlighted in bold are a complete set too).

An example set of four recipes that could be tested

Recipe Gender Colour Wording
A Man Red Sale
B Man Black Buy
C Woman Red Buy
D Woman Black Sale

Notice that in the full set of eight recipes, each variation (man or woman, red or black, sale or buy) appears four times each.  In the subset of four recipes to be tested, each variation appears twice, and this confirms that the subset is suitable for testing.

The visuals for the four approved test recipes are:

Recipe A
Recipe C
Recipe D

And we can see by inspection that the four recipes do indeed have two with the man, two with the woman; two with red text and two with black; two with "Buy Now!" and two with "On Sale!"

The next step is to run the test as if it were an A/B/C/D test - with one difference:  it's quite possible that one or more of the four test recipes may do very badly (or very well) compared to the others.  However, it's highly recommended (but not essential) that you run all four recipes for the same length of time, and allow them to obtain equal numbers of traffic.  In an MVT test run, it's important to have a large enough population of visitors for each recipe - it's not just about running until one of the four is signficantly better (or worse) than the others and calling a winner.


Let's assume that we've run the test, and obtained the following data:

Recipe A B C D
Gender Man Woman Woman Man
Wording Buy Now Buy Now On Sale On Sale
Colour Black Red Black Red
Impressions 1010 1014 1072 1051
Clicks 341 380 421 291
Click-through rate 34% 37% 41% 28%

It looks from these results as if the winner is Recipe C; the picture of the woman, with black text saying, "On Sale!".  However, there are four other recipes that we didn't test, but we can infer their relative performance by doing some judicious arithmetic with the data we have.

To begin with, we can identify which colour is better, black or red, by comparing the two recipes which have black text against the two recipes which have red text.

This might seem dangerous or confusing, but let's think about it.  The two recipes which have black text are A and C.  For recipe A, we have a man with "Buy Now!" and for recipe C, we have a woman with "On Sale!".  The net result of combining recipe A and C is to isolate everybody who saw black text, with the other elements being reduced to noise (no net contribution from either element).  This  works logically when we compare A and C with the combination of B and D.  B and D both have red text, but half have a man and half have a woman; half have "On Sale!" and half have "Buy Now!".  The consequence of this is that we can isolate the effect of black text against red text - the other factors are reduced to noise.

We could think of this mathematically, using simple expressions:

A+C = (Man + Buy Now + Black) + (Woman + On Sale + Black)
A+C = Man + Woman + Buy Now + On Sale + 2xBlack

B+D =(Woman + Buy Now + Red) + (Man + On Sale + Red)
B+D = Man + Woman + Buy Now + On Sale + 2xRed

Subtracting one from the other, and cancelling like terms...
A+C - B+D = 2xBlack - 2xRed

When we compare A+C and B+D, we get this:

Recipe A+C (black) B+D (red)
Total impressions 2082 2065
Total clicks 762 671
CTR 36.6% 32.5%

So we can see that A+B (black) is better than C+D (red) - and we can attribute an element contribution of +12.63% to the colour black.

We can also do the maths to obtain the best gender and wording:

Gender:  A+D = man, B+C = woman
Recipe A+D B+C
Total impressions 2061 2086
Total clicks 632 801
CTR 30.7% 38.4%
Result:  woman is 25.2% better than man (on CTR in this test ;-) )

Wording: A+B = Buy Now, C+D = On Sale
Recipe A+B C+D
Total impressions 2024 2123
Total clicks 721 712
CTR 35.6% 33.5%
Result:  Buy Now is 6.22% better than On Sale

Summarising our results:

Result:  black is 12.63% better than red
Result:  woman is 25.2% better than man

Result:  Buy Now is 6.22% better than On Sale

The winner!
The winning combination is black, buy now with woman, which is one that we didn't actually include in our test recipes.  The recommended follow-up is to test the winning recipe from the four that we did test against the proposed winner from the analysis we've just done.  Where that isn't possible, for whatever reason, you could test your existing control design against the proposed winner.  Alternatively, you could just go implement the theoretical winner without testing - it's up to you.

A brief note on the analysis:  this shows the importance of keeping all test recipes running for an equal length of time, so that they receive approximatley equal volumes of traffic.  Here, recipes A, B, C and D all received around 1000 impressions, but if one of them had significantly fewer (because it was switched off early because it "wasn't performing well") then that recipe would not have an equal weighting in the calculations where we compared the pairs of recipes, and its perceived performance would be higher than its actual.

I hope I've been able to show in this article (and the previous one) how it's possible to set up and analyse a multi-variate test, starting with the principles of identifying the variables you want to test, then establishing which recipes are required, and then showing how to analyse the results you obtain.


Image credits: 
man  -
woman -