Tuesday, 11 October 2011

MVT: A Simplified Explanation of Complex Interactions

MVT WITH FRIDGE MAGNETS


My young daughter has developed a definite liking for Innocent Fruit Smoothies, which is great for the rest of us because she's guaranteed to get at least one of her five-a-day with every carton she drinks.  She and I also like the sets of magnets that come with special promotional packs; the current promotion is pictures of letters, but previously, it's been pictures of parts of different characters - heads, torsos and legs.  Looking at these yesterday, it occurred to me that mixing and matching the body parts was similar to optimising content in a multi-variate test, and also a good description of the difference between A/B and MVT.


In the same way as various parts of a web page can be changed, there are three parts of the characters that can be changed - the head, the torso and the legs, and there are a large number of different versions of each that can be used in the different areas.  


Here's the full collection that we currently have in our kitchen...


                           1                        2                               3                           4                                5
Now, consider building a web page with three different components - in a similar way to building a body with the three different magnets.  If we A/B/n test each of the five combinations above, then we might get the following results for each of the different components.  


Recipe 1:  350 points
Recipe 2:  475 points
Recipe 3:  420 points
Recipe 4:  430 points
Recipe 5:  320 points


And based on these scores, the winning recipe (or version, or whatever you'd like to call it) from our A/B tests is Recipe 2.  But then we'd go on to do separate A/B tests on the head, then we'd do the torso and, and then the legs.  These show that the best performing combination is Bigfoot Head, Scarecrow Torso, Astronaut Legs:






However, this only takes the results of the separate A/B tests in isolation.  Looking at the different options we have available, we can see just by looking that there's a better combination, which is this one:


This is the difference between MVT and A/B testing:  our A/B tests would not have realised that this combination would be a winning combination because they were only looking at each test by itself.  True MVT is not a series of simultaneous A/B tests, looking to improve each page component individually.  From a mathematical and scientific standpoint, the large number of combinations or recipes that are possible all need to be tested, making sure that each possible combination is included in the test.  However, this method of testing, called "full factorial", is really not feasible, and would take a very long time before the results could be confirmed, as the performance of each and every combination has to be tested.  Instead, there are various ways of testing a smaller group of the recipes, which will enable us to obtain results for each component, and to identify the best performer - even if we don't test it.  So, we'll be able to improve our testing method from simultaneous A/B testing (which has many flaws), to something which is approaching multi-variate testing.


As an example, here are some fictitious results of an MVT test series I've run, using the fridge magnets as my examples.  I've simplified the different options from the wide range I started with (just to keep things readable and understandable).  I've got the three positions - Head, Body and Legs - and I've got three different options.  


For the head, there's Egyptian, Bigfoot and Wrestler.





For the body or torso, there's Bigfoot, Scarecrow and Wrestler.



And for the feet, there's Wrestler, Robot and Astronaut.


 

Now, three different positions with three different options for each position gives us a total of 3^3 recipes, which is 27, and this would be a "full factorial" test, with the full range of recipes being tested.  However, by carrying out some MVT, it's possible to cut this down to just six tests, and here they are, with their corresponding "scores".



Test Head Torso  Feet  "Score"
1 Egyptian  Bigfoot  Wrestler  355
2 Wrestler  Bigfoot  Robot  379
3 Bigfoot  Wrestler  Wrestler  498
4 Wrestler  Wrestler  Astronaut 448
5 Egyptian  Scarecrow Astronaut 305
6 Bigfoot  Scarecrow Robot  420 


Note that each option for head, body and feet appears twice in each column, and that test 1 is the control version.  Without having to test all the versions, 


we can see from our results that Wrestler is clearly the better body - it featured in both of the highest scoring recipes.  Egyptian is also the weakest Head - it featured in the two lowest performing recipes.  


A good MVT software system will be able to determine how many tests are required to cover enough recipes and measure the effectiveness of each of its tests, and attribute these to the components of the recipe, so that it can provide the winning recipe.  Some MVT software providers, including Autonomy's Optimost software, provide an element contribution report after carrying out a round of MVT, which shows how each element affects the performance of any recipe it's included in.


For those who are interested, I used the following points system in producing my results - this is my approximate 'element contribution report'.
Head  Torso   Legs
Egyptian 75 Bigfoot  138 Wrestler  141
Bigfoot 150 Scarecrow 148 Robot  122
Wrestler 100 Wrestler  105 Astronaut 180


I deliberately adjusted the totals after summing, to highlight the effect of interactions; this was to promote the scores for an all-Wrestler version (in other words, I artificially scored any interactions higher - which would not necessarily happen without testing).  The total for each recipe was adjusted by a random setting, + or - up to 5%, to provide a small air of authenticity.  The problem with an element contribution report, however, is that it ignores any possible enhancements or interactions that we might get from specific combinations of elements - I had to adjust this manually afterwards.  By testing more actual recipes, it might be possible to start to uncover some of the interactions between the variables in the test.  It may not identify them, and the system may not attribute them correctly in its results; this would mean that it may not account for them fully when it determines the 'winning' recipe.  However, it's better than the isolated A/B tests that we were carrying out at the start of this post.  

Here are a few more fictitious results that show recipe results comparing the ‘actual’ test results, compared to their predicted results from the few recipes we tested above.


Test Head Torso Legs           Predicted “Score" Actual “Score”
7        Wrestler Wrestler Wrestler          449                          550
8       Bigfoot Bigfoot Astronaut                -                             502                          


Interestingly, the test results indicate that we should work on developing a test version of legs for Bigfoot.  Look at the results for tests 4 and 8.
Test  Head  Torso  Legs      Score
4 Wrestler Wrestler Astronaut 498
8 Bigfoot Bigfoot Astronaut 502


Tests 4 and 8 both have matching heads and torsoes, each with the astronaut head. In test 7, when we had a complete Wrestler, we obtained the maximum positive interaction, and achieved a bonus of 100 points.  Based on tests 4 and 8, where the scores were similar for a two-thirds body, it seems reasonable to assume that a complete Bigfoot will have a similar value as a complete Wrestler.  However, we don’t know the value of Bigfoot legs.  And worse still - or more importatnly, we don’t even know what Bigfoot legs look like, which is the tricky part.  So now we really begin iterative testing.  You didn’t really think that just because we’ve moved from A/B testing to MVT, that we’d completely optimise the page with just one round of MVT, did you?  8-)


And before you ask, yes, this is very over-simplified, and yes the figures are contrived.  As I've said before, MVT is not going to fix a website by itself - it will always require some thinking time to actually look at the results and analyse them, and then proceed through the analysis - recommendation - action - test cycle.


There are various "engines" available for building and then serving the MVT recipes I've shown above (I devised the recipes and then built my table of results long-hand, which was just about manageable for a 3x3x3 test).  One of the most popular engines, that a number of MVT providers use, is the Taguchi method of testing, which is used by some MVT service providers.  


The Taguchi method was designed in the 1940s and 1950s by a Japanese scientist and engineer called Genichi Taguchi.  He devised a radical new way of improving manufacturing quality, which was refined and perfected in a wide range of manufacturing applications, including the Japanese car and telecom industries.  This technique, the Taguchi Method of Process Improvement, can be applied to online testing, but it doesn't work quite as effectively as it does for manufacturing.  The online environment in the 21st century is very different from the manufacturing industry.  In particular, the Taguchi method doesn't properly consider the dependencies or synergies between the different areas – the 'interactions' – and assumes that each variable can be optimised independently from the others.  


A simple definition of the interactions between variables in a test like this is that the performance of one or more parts of the test depends on what else is being shown in the other parts, so that they can’t be optimised independently from each other.  I briefly mentioned this in my previous post, where I looked at the interaction between an image and the caption that went with it – but I'm hoping that this example with the fridge magnets is a little clearer.  


Another way of putting it is by saying, "Yes, A will beat B, unless we use D instead of C.  It depends."  If the success of A over B depends on using D or C, then there's an interaction there. 


Some so-called MVT service providers don't really carry out true multi-variate testing, instead they just carry out a range of simultaneous A/B tests, and don't look at the interactions between the different page components, and this leads to a sub-optimal solution.  Please don't misunderstand me, this will probably be an improvement on an untested version (unless the original had a strong positive interaction), but it's highly likely that it's not the best solution.  


Other, more sophisitcated providers have their own custom-built MVT engines which claim to be able to produce test recipes which will cover the full range of combinations (without having to test them all) and still be able to take interactions into consideration.  I can't comment on how effective they actually are (I've not used them, I've just read their whitepapers and their sales blurbs) but the key players, from what I've read, are


Vertster – followers and proponents of the Taguchi method of testing


Autonomy Optimost – mentioned above – do not use Taguchi, due to its limitations


Site Tuners – aware of the various methods for testing, and cover them all, have a strong awareness of the issues of interactions (and I borrowed their images in my previous post on interactions).


In conclusion, I think it’s reasonable to say that any testing is better than none, and considered, thoughtful testing is better than just testing.  It’s not just about the tools, it’s more about the brains and the process.  By doing any form of testing – and I should say that A/B testing is not the poor relation – you are on the right path to improving your website’s performance.

No comments:

Post a Comment