Wednesday, 16 July 2014

When to Quit Iterative Testing: Snakes and Ladders



I have blogged a few times about iterative testing, the process of using one test result to design a better test and then repeating the cycle of reviewing test data and improving the next test.  But there are instances when it's time to abandon iterative testing, and play analytical snakes and ladders instead.  Surely not?  Well, there are some situations where iterative testing is not the best tool (or not a suitable tool) to use in online optimisation, and it's time to look at other options.  I have identified three examples where iterative testing is totally unsuitable:

1.  You have optimised an area of the page so well that you're now seeing the law of diminshing returns - your online testing is showing smaller and smaller gains with each test and you're reaching the top of the ladder.
2.  The business teams have identified another part of the page or site that is a higher priority than the area you're testing on.
3.  The design teams want to test something game-changing, which is completely new and innovative.

This is no bad thing.

After all, iterative testing is not the be-all-and-end-all of online optimization.  There are other avenues that you need to explore, and I've mentioned previously the difference between iterative testing and creative testing.  I've also commented that fresh ideas from outside the testing program (typically from site managers who have sales targets to hit) are extremely valuable.  All you need to work out is how to integrate these new ideas into your overall testing strategy.  Perhaps your testing strategy is entirely focused on future-state (it's unlikely, but not impossible). Sometimes, it seems, iterative testing is less about science and hypotheses, and more like a game of snakes and ladders.

Let's take a look at the three reasons I've identified for stopping iterative testing.

1.  It's quite possible that you reach the optimal size, colour or design for a component of the page.  You've followed your analysis step by step, as you would follow a trail of clues or footsteps, and it's led you to the top of a ladder (or a dead end) and you really can't imagine any way in which the page component could be any better.  You've tested banners, and you know that a picture of a man performs better than a woman, that text should be green, the call to action button should be orange and that the best wording is "Find out more."  But perhaps you've only tested having people in your banner - you've never tried having just your product, and it's time to abandon iterative testing and leap into the unknown.  It's time to try a different ladder, even if it means sliding down a few snakes first.

2.  The business want to change focus.  They have sales performance data, or sales targets, which focus on a particular part of the catalogue:  men's running shoes; ladies' evening shoes, or high-performance digital cameras.  Business requests can change far more quickly than test strategies, and you may find yourself playing catch-up if there's a new priority for the business.  Don't forget that it's the sales team who have to maintain the site, meet the targets and maximise their performance on a daily basis, and they will be looking for you to support their team as much as plan for future state.  Where possible, transfer the lessons and general principles you've learned from previous tests to give yourself a head start in this new direction - it would be tragic if you have to slide down the snake and start right at the bottom of a new ladder.

3.  On occasions, the UX and design teams will want to try something futuristic, that exploits the capabilities of new technology (such as Scene 7 integration, AJAX, a new API, XHTML... whatever).  If the executive in charge of online sales, design or marketing has identified or sponsored a brand new online technology that will probably revolutionise your site's performance, and he or she wants to test it, then it'll probably get fast-tracked through the tesing process.  However, it's still essential to carry out due diligence in the testing process, to make sure you have a proper hypothesis and not a HIPPOthesis.  When you test the new functionality, you'll want to be able to demonstrate whether or not it's helped your website, and how and why.  You'll need to have a good hypothesis and the right KPIs in place.  Most importantly - if it doesn't do well, then everybody will want to know why, and they'll be looking to you for the answers.  If you're tracking the wrong metrics, you won't be able to answer the difficult questions.

As an example, Nike have an online sports shoe customisation option - you can choose the colour and design for your sports shoes, using an online palette and so on.  I'm guessing that it went through various forms of testing (possibly even A/B testing) and that it was approved before launch.  But which metrics would they have monitored?  Number of visitors who tried it?  Number of shoes configured?  Or possibly the most important one - how many shoes were purchased?  Is it reasonable to assume that because it's worked for Nike, that it will work for you, when you're looking to encourage users to select car trim colours, wheel style, interior material and so on?  Or are you creating something that's adding to a user's workload and making it less likely that they will actually complete the purchase?

So, be aware:  there are times when you're climbing the ladder of iterative testing that it may be more profitable to stop climbing, and try something completely different - even if it means landing on a snake!

Friday, 11 July 2014

Is Multi-Variate Testing Really That Good?

The second discussion that I led at the Digital Analytics Hub in Berlin in June was entitled, "Is Multi Variate Testing Really That Good?"  Although only a few delegates attended, it got some good participation from a range of people representing a range of analytical and digital professionals, and in this post I'll cover some of the key points.

- The number of companies using MVT is starting to increase, although it's a very slow increase and it still has only low adoption rates. It's not as widespread as perhaps the tool vendors would suggest.

- The main barriers (real or perceived) to MVT are complexity (in design and analysis) and traffic volumes (multiple recipes require large volumes of traffic in order to get meaningful results in a useful timeframe).

There is an inherent level of complexity in MVT, as I've mentioned before (and one day soon I will explain how to analyse the results) and the tool vendors seem to imply that the test design must also be complicated.  It doesn't.  I've mentioned in a previous post on MVT that sometimes the visual design of a multi-variate test does not have to be complicated, it can just involve a large number of small changes run simultaneously.   

The general view during the discussion was that MVT would have to involve a complicated design with a large number of variations per element (e.g. testing a call-to-action button in red, green, yellow, orange and blue, with five different wordings).  In my opinion, this would be complicated as an A/B/n test, so as an MVT it would be extremely complex, and, to be honest, totally unsuitable for an entry-level test.

We spent a lot of our discussion time discussing various pages and scenarios where MVT is totally unsuitable, such as site navigation.  A number of online sites have issues with large catalogues and navigation hierarchies, and it's difficult to decide how best to display the whole range of products - MVT isn't the tool to use here, we discussed card-sorting, brainstorming and visualisations instead of A/B testing.  This was one of the key lessons for me - MVT is a powerful tool, but sometimes, you don't need a powerful tool, you just need the basic one.  A power drill is never going to be good at cutting wood - a basic handsaw is the way to go.  It's all about selecting the right tool for the job.

Looking at MVT, as with all online optimisation programs, the best plan is to build up to a full MVT in stages, with initial MVT trials run as pilot experiments.  Start with something where the basic concept for testing is easy to grasp, even if the hypothesis isn't great.  The problem statement or hypothesis could be, "We believe MVT is a valuable tool and in order to use it, we're going to start with a simple pilot as a proof of concept."  And why not? :-)

Banners are a great place to start - after all, the marketing team spend a lot of money on it, and there's nothing quite as eye-catching as a screenshot of a banner in your test report documents and presentations.  They're also very easy to explain... let's try an example.  Three variables that can be considered are gender of the model (man or woman), wording of the banner text ("Buy now" vs "On Sale") and the colour of the text (black or red).

There are eight possible combinations in total; here are a few potential recipes:


Recipe A
Recipe B
Recipe C
Recipe D

Note that I've tried to keep the pictures similar - model is facing camera, smiling, with a blurred background.  This may be a multi-variate test, but I'm not planning to change everything, and I'm keeping track of what I'm changing and what's staying the same!!

Designing a test like this has considerable benefits: 
- it's easy to see what's being tested (no need to play 'spot the difference')
- you can re-use the same images for different recipes
- copywriters and merchandisers only need to come up with two lots of copy (which will be less than in an A/B/C/D test with multiple recipes).
- it's not going to take large numbers of recipes, and therefore is NOT going to require a large volume of traffic.

Some time soon, I'll explain how to analyse and understand the results from a multi-variate test, hopefully debunking the myths around how complicated it is.


Image credits: 
man  - http://www.findresumetemplates.com/job-interview
woman - http://www.sheknows.com/living 


Wednesday, 9 July 2014

Why Test Recipe KPIs are Vital

Imagine a straightforward A/B test, between a "red" recipe and a "yellow" recipe.  There are different nuances and aspects to the test recipes, but for the sake of simplicity the design team and the testing team just codenamed them "red" and "yellow".  The two test recipes were run against each other, and the results came back.  The data was partially analysed, and a long list of metrics was produced.  Which one is the most important?  Was it bounce rate? Exit rate? Time on page?  Does it really  matter?

Let's take a look at the data, comparing the "yellow" recipe (on the left) and the "red" recipe (on the right).

  

As I said, there's a large number of metrics.  And if you consider most of them, it looks like it's a fairly close-run affair.  

The yellow team on the left had
28% more shots
8.3% more shots on target
22% fewer fouls (a good result)
Similar possession (4% more, probably with moderate statistical confidence)
40% more corners
less than half the number of saves (it's debatable whether more or fewer saves is better, especially if you look at the alternative to a save)
More offsides and more yellow cards (1 vs 0).

So, by most of these metrics, the yellow team (or the yellow recipe) had a good result.  They might even have done better.

However, the main KPI for this test is not how many shots, or shots on target.  The main KPI is goals scored, and if you look at this one metric, you'll see a different picture.  The 'red' team (or recipe) achieved seven goals, compared to just one for the yellow team.

In A/B testing, it's absolutely vital to understand in advance what the KPI is.  Key Performance Indicators are exactly that:  key.  Critical.  Imperative. There should be no more than two or three KPIs and they should match closely to the test plan which in turn, should come from the original hypothesis.  If your test recipe is designed to reduce bounce rate, there is little point in measuring successful leads generated.  If you're aiming for improved conversion, why should you look at time on page?  These other metrics are not-key performance indicators for your test.

Sadly, Brazil's data on the night was not sufficient for them to win - even though many of their metrics from the game were good, they weren't the key metrics.  Maybe a different recipe is needed.

Tuesday, 24 June 2014

Why Does Average Order Value Change in Checkout Tests?

The first discussion huddle I led at the Digital Analytics Hub in 2014 looked at why average order value changes in checkout tests, and was an interesting discussion.  With such a specific title, it was not surprising that we wandered around the wider topics of checkout testing and online optimisation, and we covered a range of issues, tips, troubles and pitfalls of online testing.

But first:  the original question - why does average order value (AOV) change during a checkout test?  After all, users have completed their purchase selection, they've added all their desired items to the cart, and they're now going through the process of paying for their order.  Assuming we aren't offering upsells at this late stage, and we aren't encouraging users to continue shopping, or offering discounts, then we are only looking at whether users complete their purchase or not.  Surely any effect on order value should be just noise?

For example, if we change the wording for a call to action from 'Continue' to 'Proceed' or 'Go to payment details', then would we really expect average order value to go up or down?  Perhaps not.  But, in the light of checkout test results that show AOV differences, we need to revisit our assumptions.

After all, it's an oversimplification to say that all users are affected equally, irrespective of how much they're intending to spend.  More analysis is needed to look at conversion by basket value (cart value) to see how our test recipe has affected different users based on their cart value.  If conversion is affected equally across all price bands, then we won't see a change in AOV.  However, how likely is that?

Other alternatives:  perhaps there's no real pattern in conversion changes:  low-price-band, mid-price-band, high-price-band and ultra-high-price-band users show a mix of increases and decreases.  Any overall AOV change is just noise, and the statistical significance of the change is low.

But let's suppose that the higher price-band users don't like the test recipe, and for whatever reason, they decide to abandon.  The AOV for the test recipe will go down - the spread of orders for the test recipe is skewed to the lower price bands.  Why could this be?  We discussed various test scenarios:

- maybe the test recipe missed a security logo?  Maybe the security logo was moved to make way for a new design addition - a call to action, or a CTA for online chat - a small change but one that has had significant consequences.

- maybe the test recipe was too pushy, and users with high ticket items felt unnecessarily pressured or rushed?  Maybe we made the checkout process feel like express checkout, and we inadvertantly moved users to the final page too quickly.  For low-ticket items, this isn't a problem - users want to move through with minimum fuss and feel as if they're making rapid progress.  Conversely, users who are spending a larger amount want to feel reassured by a steady checkout process which allows the user to take time on each page without feeling rushed?

- sometimes we deliberately look to influence average order value - to get users to spend more, add another item to their order (perhaps it's batteries, or a bag, or the matching ear-rings, or a warranty).  No surprises there then, that average order value is influenced; sometimes it may go down, because users felt we were being too pushy.

Here's how those changes might look as conversion rates per price band, with four different scenarios:

Scenario 1:  Conversion (vertical axis) is improved uniformly across all price bands (low - very high), so we see a conversion lift and average order value is unchanged.

Scenario 2:  Conversion is decreased uniformly across all price bands; we see a conversion drop with no change in order value.

Scenario 3:  Conversion is decreased for low and medium price bands, but improved for high and very-high price bands.  Assuming equal order volumes in the baseline, this means that conversion is flat (the average is unchanged) but average order value goes up.

Scenario 4:  Conversion is improved selectively for the lowest price band, but decreases for the higher price bands.  Again, assuming there are similar order volumes (in the baseline) for each price band, this means that conversion is flat, but that average order value goes down.

There are various combinations that show conversion up/down with AOV up/down, but this is the mathematical and logical reason for the change.

Explaining why this has happened, on the other hand, is a whole different story! :-)

Friday, 30 May 2014

Digital Analytics Hub 2014 - Preview

Next week, I'll be returning to the Digital Analytics Hub in Berlin, to once again lead discussion groups on online testing.  Last year, I led discussions on "Why does yesterday's winner become today's loser?" and "Risk and Reward: Iterative vs Creative Testing."  I've been invited to return this year, and I'll be discussing "Why does Average Order Value change in checkout tests?" and "Is MVT really all that great?" - the second one based on my recent blog post asking if multi-variate testing is an online panacea. I'm looking forward to catching up with some of the friends I made last year, and to meeting some new friends in the world of online analytics.

An extra bonus for me is that the Berlin InterContinental Hotel, where the conference is held, has a Chess theme for their conference centre.  The merging of Chess with online testing and analytics? Something not to be missed.

The colour scheme for the rooms is very black-and-white; the rooms have names like King, Rook and Knight; there are Chess sets in every room, and each room has at least one enlarged photo print of a Chess-themed portrait. You didn't think a Chess-themed portrait was possible?  Here's a sample of the pictures from last year (it's unlikely that they've been changed).  From left to right, top to bottom:  white bishop, white king; white king; black queen; white knight; white knight with black rook (I think). 


Thursday, 29 May 2014

Changing Subject

I have never written politically before.  I don't really hold strong political views, and while I vote almost every time there's an election, I don't consider myself strongly affiliated with any political party.

However, the recent statement by the Secretary of State for Eduction, Rt Hon Michael Gove, has really irritated me.  He has said that he wants to reduce the range of books and literature that is studied in high schools, so that pupils will only study British authors - Shakespeare and Austen, for example.  Academics and teachers have drawn particular attention to the axing of "To Kill A Mockingbird" (which I haven't studied) since it was written by an American author.

Current Secretary of State for Education, Michael Gove
Image credit
This has particular resonance for me since I'm married to an English teacher, and she was annoyed by this decision. I'm not particularly interested in English Literature - I passed my exam when I was 16, and that's it.  Yes, I read - fiction and non-fiction alike - but only because I enjoy occasional reading, not because I studied literature in depth.

Quoting from the Independent newspaper's website, where they quote the Department for Education:

"In the past, English literature GCSEs were not rigorous enough and their content was often far too narrow. We published the new subject content for English literature in December."


Does anybody else find it ironic that they think reducing the scope of the literature to be studied will prevent the GCSE from becoming too narrow?

Aside from that, it occurred to me - what if this is the thin end of the wedge?  What if this British-centredness is to continue throughout all the other subjects?  What might they look like?  As I said, I have no personal specific interest in English Literature, but I wonder if Mr Gove has plans for the rest of the syllabus.  Could you imagine the way the DfE would share his latest ideas?  Highlighting how strange his decision on English Literature is, here is a view of how other subjects could be affected.



The New 'British' GCSE Syllabus


Chemistry

Only the chemical elements that have been discovered by British scientists will be studied.  Oxygen, hydrogen, barium, tungsten and chlorine are all out, having been discovered by the Swede Carl Wilhelm Scheele, even though other scientists published their findings first.  Scottish scientist William Ramsay discovered the noble gases, so they can stay in the syllabus, and so can most of the Group 1 and 2 metals, which were isolated by Sir Humphry Davy.  Lead, iron, gold and silver are all out, since they were discovered before British scientists were able to identify and isolate them.  And this brings me to the next subject:


History

Only historical events pertaining to the UK are to be included in the new syllabus.  The American Civil War is to be removed.  The First World War I is to be reduced to a chapter, and the Second World War to a paragraph, with much more emphasis to be given to the Home Front. 

Biology

Only plants and animals which are native to the UK are to be studied, because previously, science "GCSEs were not rigorous enough and their content was often far too narrow." All medicine which can be attributed to Hippocrates is out.  Penicillin (Alexander Fleming) to stay in.

Maths

Fibonacci - out.  da Vinci - out.  Most geometry (Pythagoras, Euclid) - out.  Calculus to focus exclusively on Newton, and all mention of Liebniz is to be removed.  In order to aid integration with Europe, emphasis must shared between British imperial measurements and the more modern metrics which our European colleagues use.

Physics

Astronomy to be taught with the earth-centric model, since the heliocentric view of the Earth going around the Sun was devised by an Italian, Galilei Galileo.  The Moon landing (American) is out.  The Higgs Boson can stay, although its discovery in Switzerland is a border-line case.  Gravity, having been explained by Isaac Newton, can stay in.


Foreign Languages

By their very nature, foreign languages are not British, and their study will probably not be rigorous enough, with content that's far too narrow.  However, in order to aid integration with our European business colleagues and government, foreign languages are to be kept.  However, this is to be limited to relevant business and economic vocabulary, and more time is to be spent learning the history of the English language instead.  Preferably by rote.

Economics

In a move which follows Mr Gove's moves towards a 1940s syllabus, economics will now focus on pounds, shilling and pence. Extra maths lessons will be given to explain how the pre-decimalised system works.  The modern pounds and pence system is to be studied, but only to enable pupils to understand how European exchange rates work. 

Changes are not planned for 'easier' GCSEs like Media Studies; Leisure and Tourism; Hospitality or Health and Social Care, since they're being axed anyway.



So, having made a few minor tweaks to the syllabus, we now have one which Mr Gove would approve of, and which would probably be viewed by the DfE as more rigorous and less narrow.  Frightening, isn't it?

Wednesday, 14 May 2014

Testing - which recipe got 197% uplift in conversion?

We've all seen them.  Analytics agencies and testing software providers alike use them:  the headline that says, 'our customer achieved 197% conversion lift with our product'And with good reason.  After all, if your product can give a triple-digit lift in conversion, revenue or sales, then it's something to shout about and is a great place to start a marketing campaign.

Here are a just a few quick examples:

Hyundai achieve a 62% lift in conversions by using multi-variate testing with Visual Website Optimizer.

Maxymiser show how a client achieved a 23% increase in orders

100 case studies, all showing great performance uplift




It's great.  Yes, A/B testing can revolutionise your online performance and you can see amazing results.  There are only really two questions left to ask:  why and how?

Why did recipe B achieve a 197% lift in conversions compared to recipe A?  How much effort, thought and planning went into the test? How did you achieve the uplift?  Why did you measure that particular metric?  Why did you test on this page?  How did you choose which part of the page to test?  How many hours went into the planning for the test?

There is no denying that the final results make for great headlines, and we all like to read the case studies and play spot-the-difference between the winning recipe and the defeated control recipe, but it really isn't all about the new design.  It's about the behind-the-scenes work that went into the test.  Which page should be tested?  It's about how the design was put together; why the elements of the page were selected and why the decision that was taken to run the test.  There are hours of planning; analysing data and writing a hypothesis that sit behind the good tests.  Or perhaps the testing team just got lucky?  

How much of this amazing uplift was down to the tool, and how much of it was due to the planning that went into using the tool?  If your testing program isn't doing well, and your tests aren't showing positive results, then probably the last thing you need to look at is the tool you're using.  There are a number of other things to look at first (quality of hypothesis and quality of analysis come to mind as starting points).

Let me share a story from a different situation which has some interesting parallels.  There was considerable controversy around the Team GB Olympic Cycling team's performance in 2012.  The GB cyclists achieved remarkable success in 2012, winning medals in almost all the events they entered.  This led to some questions around the equipment they were using - the British press commented that other teams thought they were using 'magic' wheels.  Dave Brailsford, the GB cycling coach during the Olympics, once joked that some of the competitors were complaining about the British team's wheels being more round

Image: BBC

However, Dave Brailsford previously mentioned (in reviewing the team's performance in the 2008 Olympics, four years earlier) that the team's successful performances there were due to the "aggregation of marginal gains"in the design of the bikes and equipment, which is perhaps the most concise description of the role of the online testing manager.  To quote again from the Team Sky website:


The skinsuit did not win Cooke the gold medal. The tyres did not win her the gold medal. Nor did her cautious negotiation of the final corner. But taken together, alongside her training and racing programme, the support from her team-mates, and her attention to many other small details, it all added up to a significant advantage - a winning advantage.
Read more at http://www.teamsky.com/article/0,27290,17547_5792058,00.html#zuO6XzKr1Q3hu87X.99
The skinsuit did not win Cooke the gold medal. The tyres did not win her the gold medal. Nor did her cautious negotiation of the final corner. But taken together, alongside her training and racing programme, the support from her team-mates, and her attention to many other small details, it all added up to a significant advantage - a winning advantage.
Read more at http://www.teamsky.com/article/0,27290,17547_5792058,00.html#zuO6XzKr1Q3hu87X.99
"The skinsuit did not win Cooke [GB cyclist] the gold medal. The tyres did not win her the gold medal. Nor did her cautious negotiation of the final corner. But taken together, alongside her training and racing programme, the support from her team-mates, and her attention to many other small details, it all added up to a significant advantage - a winning advantage."

It's not about wild new designs that are going to single-handedly produce 197% uplifts in performance, it's about the steady, methodical work in improving performance step by step by step, understanding what's working and what isn't, and then going on to build on those lessons.  As an aside, was the original design really that bad, that it could be improved by 197% - and who approved it in the first place?

It's certainly not about the testing tool that you're using, whether it's Maxymiser, Adobe's Test and Target, or Visual Website Optimizer, or even your own in-house solution.  I would be very wary of changing to a new tool just because the marketing blurb says that you should start to see 197% lift in conversion just by using it.

In conclusion, I can only point to this cartoon as a summary of what I've been saying.