Monday, 24 June 2013

Iterating, Creating, Risk and Reward - Discussion

My second huddle at XChange 2013 Berlin looked at what to test, how to set up a testing program and how to get management buy-in.  We talked about the best way to get a test program set up, how to achieve critical mass and how to build momentum for an online testing program.  

I was intending to revisit some of the topics from my earlier post on creative versus iterative testing, but the discussion (as with my first huddle on yesterday's winner, today's loser) very quickly went off on a tangent and never looked back!

There are a number of issues in either starting or building a testing program - here are a few that we discussed:

Lack of management buy-in
Selling web analytics and reporting is not always easy, especially if you're working in (or with) a company that's largely focused on high-street bricks-and-mortar presence, or if the company is historically telephone or catalogue.  Trying to sell the idea of online testing can be very tricky indeed.  "Why should we test - we know what's best anyway!" is a common response, but the truth is that intuition is rarely right 100% of the time; here are few counter-arguments that you may (or may not) want to try:

"Would you like to submit your own design to include in the test?"

"Could you suggest some other ideas for improving this banner/button/page?"
"Do you think there is a different way we could improve the page and reach/exceed our sales target?"

Other ways of getting management (and other staff, colleagues and stakeholders) to engage with the test is to ask them to guess which recipe or design will win - and put their names to it.  If you can market this well, then very quickly, people will start asking how the test is going, if their design is winning.  Better still, if their design is losing, they'll probably want to know why, and might even start (1) interrogating the data and (2) designing a follow-up test.

As we commented during our discussion, it's worth saying that you may need to distinguish between a bad recipe and a good manager.  "Yes, you are still a good analyst or manager or designer, it's just that people didn't like your design."

Lack of resource
This could be a lack of IT support, design support or JavaScript developer time.  Almost all tests are dependent on some sort of IT and design support (although I have heard of analysts and testers testing their own Photoshop creative).  It's difficult - as we'll see below - because without design support, you are restricted in what you can test.  However, there are a number of test areas that you can work on which are light on design, are light on code maintenance, and which could potentially show useful (and even positive) test results.

 - banner imagery - to include having people or no people; a picture of the product or no product
 - banner wording - buy-one-get-one-free, or two-for-one, or 50% off?  Or maybe even 'Half price'?  Wording will probably require even less design work than imagery, and you (as the tester, or analyst) may even be able to set this one up yourself.
 - calls to action - Continue?  Add to cart? Add to basket?  Select? Make payment?  This site has a huge gallery of continue shopping buttons, (when a customer has added an item to basket, and you want to persuade them to keep shopping).  There are some suggestions on which may work best - and they don't even change the wording.  There are many other things to try - colour; arrow or no arrow; CAPITALS or Initial Capitals? 

The advantage of these tests is that they can be carried out on the same area of the same page - once the test code has been inserted by the IT or JavaScript teams, you can set up a series of tests just by changing the creative that is being tried.  Many of those in the huddle said that once they had obtained a winner, they would then push that to 100% traffic through the testing software until the next test was ready - further reducing the dependency on IT support.

How to sell flat results

There is nothing worse for an analyst or tester to find out that the test results are flat (there's no significant difference in the performance of the test recipes - all the results were the same).  The test has taken months to sell, weeks to design and code, and a few weeks to run, and the results say that there's no difference between the original version (which may have had management backing) and your new analytics-backed version.  And what do you get?  "You said that online testing would improve our performance by 2%, 5%, 7.5%..."

Actually, the results only appear to say there's no difference... so it's time to do some digging!

Firstly, was the difference between the two test recipes large enough and distinct enough?  One member of the huddle quoted the Eisenberg brothers: "If you ask people if they prefer green apples or red apples, you're unlikely to get a difference.  If you ask them if they prefer apples or chocolate, you'll see a result."

This is something to consider before the test - are the recipes different enough?  It's not always easy to say in advance (!) and there is a greater risk of the test recipe losing if the design is too different, but that's the point - iterating is 'safer' than creating, but does include the possibility that it may go flat.  How much risk you're prepared to take may depend on external factors such as how much design resource you can obtain and how important it is to get a non-zero result.

Secondly:  analysing flat results will require some concerted data analysis.  Overall, the number of orders for the two recipes, and the average order value were the same...

But how many people clicked on the new banner?  Or how many people bounced or exited from the test page?

Did you get more people to click on your new call-to-action button - and then those people left at the next page? Why?

Did the banner work better for higher-value customers, who then left on the next page because the item they were actually looking for wasn't featured?  Did all visitor segments behave in the same way?

Was there a disconnect between the call to action and the next page?  Was the next page really what people would have expected?  

Did you offer a 50%-off deal but then not make it clear in the checkout process?  It's human nature to study and review a test loss, to accept a win without too much study and to completely write off a flat result, but by applying the same level of rigour to a 'flat' result as to a loss, it's still possible to learn something valuable.

How do you set up a testing program?
We discussed how managers and clients generally prefer to start a testing program in the checkout process - it's a nice, easy, linear funnel with plenty of potential for optimisation, and it's very close to the money.  If you improve a checkout page, then the financial metrics will automatically improve as a result.

But how do you test in the product description pages, where visitors browse around before selecting an item?  We talked about page purpose:  what is the idea of a page?  What's the main action that you want a user to take after they have seen this page?  Is it to complete a lead generation form?  Is it to call the sales telephone line?  Is it to 'add to cart'?  The success metric is for the page should be the key success metric for the test.  You'll need to keep an eye on the end-of-funnel metrics (conversion, order value, and so on) but providing those are flat or trending positively, then you can use the page-purpose metrics to measure the success of your test.  If you're tracking an offline conversion (call the sales line, for example) then you'll need to do some extra preparatory work, for example by setting up one telephone line per recipe and then arrange to track the volumes of telephone calls - but it'll make the test result more useful.

Tracking page-purpose success metrics will also enable to you to run tests more quickly.  If you can see a definite, confident lift in a page-purpose metrics, while the overall financial metrics are flat or positive, then you can call a winner before you reach confidence in the overall metrics.  The further you are from the checkout process (and the final order page), the longer it is likely to take for an uplift in page performance to filter through to the financial results (in terms of testing time), but you can be happy that you are improving your customers' experience.


Another valuable way of helping to build a testing program, and enabling it to develop, is to document your tests.  When a test is completed, you'll probably be presenting to the management and the stakeholders - this is also a great opportunity to present to the people who contributed to your test: the designers, the developers, IT and so on.  This applies especially if the test is a winner!

When the presentation is completed, file the results deck on a network drive, or somewhere which is widely accessible.  Start to build up a list of test recipes, results and findings.  We discussed if this is a worthwhile exercise - it's time-consuming, laborious and if there's only one analyst working on the test program, it seems unnecessary.

However, this has a number of benefits:

- you can start to iterate on previous tests (winners, losers and flat results), and this means that future tests are more likely to be successful ("We did this three weeks ago and the results were good, less try to make them even better")

- you can avoid repeating tests, which is a waste of time, resource and energy ("We did this two months ago and the results were negative")

- you can start to understand your customers' behaviour and design new tests (based on the data) which are more likely to win.  ("This test showed our visitors preferred this... therefore I suspect they will also prefer this...).

It's also useful when and if the team starts to grow (which is a positive result of a growing testing program) as you can share all the previous learnings.  

These benefits will help the testing program gain momentum, so that you can start iterating and spend less time repeating yourself.  Hopefully, you'll find that you have fewer meetings where you have to sell the idea of testing - you can point back at prior wins and say to the management, "Look, this worked and achieved 3% lift," and, if you're feeling brave, "And look, you said this recipe would win and it was 5% below the control recipe!"

The discussion ran for 90 minutes, and we discussed even more than this... I just wish I'd been able to write it all down.  I'd like to thank all the huddle participants, who made this a very interesting and enjoyable huddle!

Wednesday, 19 June 2013

Why is yesterday's test winner today's loser?

This post comes out of the xChange Berlin huddle which I led on 11 June 2013.  xChange is very different from most web analytics conferences - most conferences have speakers and presentations, but xChange is focused around web analytics professionals meeting and discussing in small workshop groups.  As the xChange website describes it:
"Expressly designed for enterprise analytics managers and digital marketing and measurement practitioners, X Change brings together top professionals in the field in a no-sales, all business, peer-to-peer environment for deep-dives into cutting edge online measurement topics."

At xChange Berlin 2013, I led two huddle groups - this was the first, entitled, "Why is yesterday's test winner today's loser?".  I haven't attributed the content here to any particular participant - this is just a summary of our discussions.  I should say now that the discussion was not even close to what I'd anticipated, but was even more interesting as a result!

The discussion kicked off with a review of a test win.  
Let's suppose that you have run your A/B test, and you have a winner.  You ran it for long enough to achieve statistical significance and even achieved consistent trend lines.  But somehow, when you implemented it, your financial metrics didn't show the same level of improvement as your test results.  And now, the boss has come to your desk to ask if your test was really valid.  "What happened?  Why is yesterday's test winner today's loser?"

There are a number of reasons for this - let's take a look.

External factors
Yes, A/B tests split your traffic evenly between the test recipes, so that most external factors are accounted for.  But what happens if your test was running while you had a large-scale TV campaign, or display or PPC campaign?  Yes, that traffic would have been split between your test recipes, so the effect is - apparently - mitigated.  But what if the advertising campaign resonated with your test recipe, which went on to win.  During the non-campaign period, the control recipe would be better, or perhaps the results would have been more similar.  Consequently, the uplift that you saw during the test would not be achieved in normal conditions.

Customer Experience Changes
When we start a test, there is quite often a dip in performance for the test recipe.  It's new.  It's unfamiliar and users have to become accustomed to it.  It often takes a week or so for visitors to get used to it, and for accurate, meaningful and useful test results to develop.  In particular, frequent repeat visitors will take some time to adjust to the changes (how often repeat visitors return will depend on your site).  The same issue applies when you implement a winner - now, the whole population is seeing a new design, and it will take some time for them to adjust.

Visitor Segments
Perhaps the test recipe worked especially well with a particular visitor segment?  Maybe new visitors, or search visitors, or visitors from social media, and that was responsible for the uplift.  You have assumed (one way or another) that your population profile is fairly constant.  But if you identify that your test recipe won because one or two segments really engaged with it, then you may not see the uplift if your population profile changes.  What should you do instead?  Set up a targeting implementation: target specific visitors, based on your test results, who engaged more (or converted better) with the test recipe.  Show everybody else the same version of your site as usual, but for visitors who fit into a specific segment - show them the test recipe.  I'll discuss targeting again at a later date, but here's a post I wrote a few months ago about online personalisation.

Time lapse between test win and implementation
This varied around the members of the group - where a company has a test plan, and there's a need to get a test up and running, it may not be possible to implement straight away.  It also depends on what's being tested - can the test recipe be implemented immediately through the site team or CMS, or will it require IT roadmap work?  Most of the group would use either the testing software (for example, Test and Target, or Visual Website Optimiser) and immediately set a winning recipe to 100% traffic (or 95%) until the change could be made permanently.  Setting a winning recipe to 95% instead of 100% in effect enables the test to run for longer - you can continue to show that the test recipe is winning.  It also means that visitors who were in the control group during the test (i.e. saw "Recipe A") will continue to see that recipe until the implementation is complete - better customer experience for that group?  Something to think about!

My next post will be about the second huddle that I led, which was based on iterating vs creating.  The title came from my recent blog post on iterative testing, but the discussion went in a very different direction, and again, was better for it!