Web Optimisation, Maths and Puzzles: iterative

Showing posts with label iterative. Show all posts

Wednesday, 16 July 2014

When to Quit Iterative Testing: Snakes and Ladders

I have blogged a few times about iterative testing, the process of using one test result to design a better test and then repeating the cycle of reviewing test data and improving the next test. But there are instances when it's time to abandon iterative testing, and play analytical snakes and ladders instead. Surely not? Well, there are some situations where iterative testing is not the best tool (or not a suitable tool) to use in online optimisation, and it's time to look at other options.

Three situations where iterative testing is totally unsuitable:

1. You have optimised an area of the page so well that you're now seeing the law of diminshing returns - your online testing is showing smaller and smaller gains with each test and you're reaching the top of the ladder.
2. The business teams have identified another part of the page or site that is a higher priority than the area you're testing on.
3. The design teams want to test something game-changing, which is completely new and innovative.

This is no bad thing.

After all, iterative testing is not the be-all-and-end-all of online optimization. There are other avenues that you need to explore, and I've mentioned previously the difference between iterative testing and creative testing. I've also commented that fresh ideas from outside the testing program (typically from site managers who have sales targets to hit) are extremely valuable. All you need to work out is how to integrate these new ideas into your overall testing strategy. Perhaps your testing strategy is entirely focused on future-state (it's unlikely, but not impossible). Sometimes, it seems, iterative testing is less about science and hypotheses, and more like a game of snakes and ladders.

Three reasons I've identified for stopping iterative testing.

1. It's quite possible that you reach the optimal size, colour or design for a component of the page. You've followed your analysis step by step, as you would follow a trail of clues or footsteps, and it's led you to the top of a ladder (or a dead end) and you really can't imagine any way in which the page component could be any better. You've tested banners, and you know that a picture of a man performs better than a woman, that text should be green, the call to action button should be orange and that the best wording is "Find out more." But perhaps you've only tested having people in your banner - you've never tried having just your product, and it's time to abandon iterative testing and leap into the unknown. It's time to try a different ladder, even if it means sliding down a few snakes first.

2. The business want to change focus. They have sales performance data, or sales targets, which focus on a particular part of the catalogue: men's running shoes; ladies' evening shoes, or high-performance digital cameras. Business requests can change far more quickly than test strategies, and you may find yourself playing catch-up if there's a new priority for the business. Don't forget that it's the sales team who have to maintain the site, meet the targets and maximise their performance on a daily basis, and they will be looking for you to support their team as much as plan for future state. Where possible, transfer the lessons and general principles you've learned from previous tests to give yourself a head start in this new direction - it would be tragic if you have to slide down the snake and start right at the bottom of a new ladder.

3. On occasions, the UX and design teams will want to try something futuristic, that exploits the capabilities of new technology (such as Scene 7 integration, AJAX, a new API, XHTML... whatever). If the executive in charge of online sales, design or marketing has identified or sponsored a brand new online technology that will probably revolutionise your site's performance, and he or she wants to test it, then it'll probably get fast-tracked through the testing process. However, it's still essential to carry out due diligence in the testing process, to make sure you have a proper hypothesis and not a HIPPOthesis. When you test the new functionality, you'll want to be able to demonstrate whether or not it's helped your website, and how and why. You'll need to have a good hypothesis and the right KPIs in place. Most importantly - if it doesn't do well, then everybody will want to know why, and they'll be looking to you for the answers. If you're tracking the wrong metrics, you won't be able to answer the difficult questions.

As an example, Nike have an online sports shoe customisation option - you can choose the colour and design for your sports shoes, using an online palette and so on. I'm guessing that it went through various forms of testing (possibly even A/B testing) and that it was approved before launch. But which metrics would they have monitored? Number of visitors who tried it? Number of shoes configured? Or possibly the most important one - how many shoes were purchased? Is it reasonable to assume that because it's worked for Nike, that it will work for you, when you're looking to encourage users to select car trim colours, wheel style, interior material and so on? Or are you creating something that's adding to a user's workload and making it less likely that they will actually complete the purchase?

So, be aware: there are times when you're climbing the ladder of iterative testing that it may be more profitable to stop climbing, and try something completely different - even if it means landing on a snake!

Wednesday, 14 May 2014

Testing - which recipe got 197% uplift in conversion?

We've all seen them. Analytics agencies and testing software providers alike use them: the headline that says, 'our customer achieved 197% conversion lift with our product'. And with good reason. After all, if your product can give a triple-digit lift in conversion, revenue or sales, then it's something to shout about and is a great place to start a marketing campaign.

Here are a just a few quick examples:

Hyundai achieve a 62% lift in conversions by using multi-variate testing with Visual Website Optimizer.

Maxymiser show how a client achieved a 23% increase in orders

100 case studies, all showing great performance uplift

It's great. Yes, A/B testing can revolutionise your online performance and you can see amazing results. There are only really two questions left to ask: why and how?

Why did recipe B achieve a 197% lift in conversions compared to recipe A? How much effort, thought and planning went into the test? How did you achieve the uplift? Why did you measure that particular metric? Why did you test on this page? How did you choose which part of the page to test? How many hours went into the planning for the test?

There is no denying that the final results make for great headlines, and we all like to read the case studies and play spot-the-difference between the winning recipe and the defeated control recipe, but it really isn't all about the new design. It's about the behind-the-scenes work that went into the test. Which page should be tested? It's about how the design was put together; why the elements of the page were selected and why the decision that was taken to run the test. There are hours of planning; analysing data and writing a hypothesis that sit behind the good tests. Or perhaps the testing team just got lucky?

How much of this amazing uplift was down to the tool, and how much of it was due to the planning that went into using the tool? If your testing program isn't doing well, and your tests aren't showing positive results, then probably the last thing you need to look at is the tool you're using. There are a number of other things to look at first (quality of hypothesis and quality of analysis come to mind as starting points).

Let me share a story from a different situation which has some interesting parallels. There was considerable controversy around the Team GB Olympic Cycling team's performance in 2012. The GB cyclists achieved remarkable success in 2012, winning medals in almost all the events they entered. This led to some questions around the equipment they were using - the British press commented that other teams thought they were using 'magic' wheels. Dave Brailsford, the GB cycling coach during the Olympics, once joked that some of the competitors were complaining about the British team's wheels being more round.

Image: BBC

However, Dave Brailsford previously mentioned (in reviewing the team's performance in the 2008 Olympics, four years earlier) that the team's successful performances there were due to the "aggregation of marginal gains"in the design of the bikes and equipment, which is perhaps the most concise description of the role of the online testing manager. To quote again from the Team Sky website:

The skinsuit did not win Cooke the gold medal. The tyres did not win her the gold medal. Nor did her cautious negotiation of the final corner. But taken together, alongside her training and racing programme, the support from her team-mates, and her attention to many other small details, it all added up to a significant advantage - a winning advantage.
Read more at http://www.teamsky.com/article/0,27290,17547_5792058,00.html#zuO6XzKr1Q3hu87X.99

"The skinsuit did not win Cooke [GB cyclist] the gold medal. The tyres did not win her the gold medal. Nor did her cautious negotiation of the final corner. But taken together, alongside her training and racing programme, the support from her team-mates, and her attention to many other small details, it all added up to a significant advantage - a winning advantage."

It's not about wild new designs that are going to single-handedly produce 197% uplifts in performance, it's about the steady, methodical work in improving performance step by step by step, understanding what's working and what isn't, and then going on to build on those lessons. As an aside, was the original design really that bad, that it could be improved by 197% - and who approved it in the first place?

It's certainly not about the testing tool that you're using, whether it's Maxymiser, Adobe's Test and Target, or Visual Website Optimizer, or even your own in-house solution. I would be very wary of changing to a new tool just because the marketing blurb says that you should start to see 197% lift in conversion just by using it.

In conclusion, I can only point to this cartoon as a summary of what I've been saying.

Thursday, 1 May 2014

Iterative Testing - Follow the Numbers

Testing, as I have said before, is great. It can be adventurous, exciting and rewarding to try out new ideas for the site (especially if you're testing something that IT can't build out yet) with pie-in-the-sky designs that address every customer complaint that you've ever faced. Customers and visitors want bigger pictures, more text and clearer calls to action, with product videos, 360 degree views and a new Flash or Scene 7 interface that looks like something from Minority Report or Lost in Space.

Your new user interface, Minority Report style? Image credit

That's great - it's exciting to be involved in something futuristic and idealised, but how will it benefit the business teams who have sales targets to reach for this month, quarter or year? They will accept that some future-state testing is necessary, but will want to optimise current state, and will probably have identified some key areas from their sales and revenue data. They can see clearly where they need to focus the business's optimisation efforts and they will start synthesising their own ideas.

And this is all good news. You're reviewing your web analytics tools to look at funnels, conversion, page flow and so on; you may also have session replay and voice-of-the-customer information to wade through periodically, looking for a gem of information that will form the basis of a test hypothesis. Meanwhile, the business and sales teams have already done this (from their own angle, with their own data) and have come up with an idea.

So you run the test - you have a solid hypothesis (either from your analytics, or from the business's data) and a good idea on how to improve site performance.

But things don't go quite to plan; the results are negative, conversion is down or the average order value hasn't gone up. You carry out a thorough post-test analysis and then get everybody together to talk it through. Everbody gathers around a table (or on a call, with a screen-share ;-) - everybody turns up: the design team, the business managers, the analysts... everybody with a stake in the test, and you talk it through. Sometimes, good tests fail. Sometimes, the test wins (this is also good, but for some reason, wins never get quite as much scrutiny as losses).

And then there's the question: "Well, we did this in this test recipe, and things improved a bit, and we did that in the other test recipe and this number changed: what happens if we change this and that?" Or, "Can we run the test again, but make this change as well?"

These are great questions. As a test designer, you'll come to love these questions, especially if the idea is supported by the data. Sometimes, iterative testing isn't sequential testing towards an imagined optimum; sometimes it's brainstorming based on data. To some extent, iterative testing can be planned out in advance as a long-term strategy where you analyse a page, look at the key elements in it and address them methodically. Sometimes, iterative testing can be exciting (it's always exciting, just moreso) and take you in directions you weren't expecting. You may have thought that one part of the page (the product image, the ratings and reviews, the product blurb) was critical to the page's performance, but during the test review meeting, you find yourself asking "Can we change this and that? Can we run the test with a smaller call to action and more peer reviews?" And why not? You already have the makings of a hypothesis and the data to support it - your own test data, in fact - and you can sense that your test plan is going in the right direction (or maybe totally the wrong direction, but at least you know which way you should be going!).

It reminds me of the quote (attributed to a famous scientist, though I can't recall which one), who said, "The development of scientific theory is not like the construction of fairy castles, but more like the methodical laying of one brick on another." It's okay - in fact it's good - to have a test strategy lined up, focusing on key page elements or on page templates, but it's even more interesting when a test result throws up questions like, "Can we test X as well as Y?" or "Can we repeat the test with this additional change included?"

Follow the numbers, and see where they take you. It's a little like a dot-to-dot picture, where you're drawing the picture and plotting the new dots as you go, which is not the same as building the plane while you're flying in it ;-).

Follow the numbers. Image credit

One thing you will have to be aware of is that you are following the numbers. During the test review, you may find a colleague who wants to test their idea because it's their pet idea (recall the HIPPOthesis I've mentioned previously). Has the idea come from the data, or an interpretation of it, or has it just come totally out of the blue? Make sure you employ a filter - either during the discussion phase or afterwards - to understand if a recipe suggestion is backed by data or if it's just an idea. You'll still have to do all the prep work - and thankfully, if you're modifying and iterating, your design and development team will be grateful that they only need to make slight modifications to an existing test design.

Yes, there's scope for testing new ideas, but be aware that they're ideas, backed by intuition more than data, and are less likely (on average) to be successful; I've blogged on this before when I discussed iterating versus creating. If your testing program has limited resource (and whose doesn't?) then you'll want to focus on the test recipes that are more likely to win - and that means following the numbers.

Wednesday, 3 July 2013

Getting an Online Testing Program Off The Ground

One of the unplanned topics from one of my xChange 2013 huddles was how to get an online testing program up and running, and how to build its momentum. We were discussing online testing more broadly, and this subject came up. Getting a test program up and running is not easy, but during our discussion a few useful hints and tips emerged, and I wanted to add to them here.

Sometimes, launching a test program is like defying gravity.

Image credit

Selling plain web analytics isn't easy, but once you have a reporting and analytics program up and running, and you're providing recommendations which are supported by data and seeing improvements in your site's performance, then the next step will probably be to propose and develop a test. Why test?

On the one hand, if your ideas and recommendations are being wholeheartedly received by the website's management team, then you may never need to resort to a test. If you can show with data (and other sources, such as survey responses or other voice-of-customer sources) that there's an issue on your site, and if you can use your reporting tools to show what the problem probably is - and then get the site changed based on your recommendations - and then see an improvement, then you don't need to test. Just implement!

However, you may find that you have a recommendation, backed by data, that doesn't quite get universal approval. How would the conversation go?

"The data shows that this page needs to be fixed - the issue is here, and the survey responses I've looked at show that the page needs a bigger/smaller product image."
"Hmm, I'm not convinced."
"Well, how about we try testing it then? If it wins, we can implement; if not, we can switch it off."
"How does that work, then?"

The ideal 'we love testing' management meeting. Image credit.

This is idealised, I know. But you get the idea, and then you can go on to explain the advantages of testing compared to having to implement and then roll back (when the sales figures go south).

The discussions we had during xChange showed that most testing programs were being initiated by the web analytics team - there were very few (or no) cases where management started the discussion or wanted to run a test. As web professionals, supporting a team with sales and performance targets, we need to be able to use all the online tools available to us - including testing - so it's important that we know how to sell testing to management, and get the resources that it needs. From management's perspective, analytics requires very little support or maintenance (compared to testing) - you tag the site (once, with occasional maintenance) and then pay any subscriptions to the web analytics provider, and pay for the staff (whether that's one member of staff or a small team). Then - that's it. No additional resource needed - no design, no specific IT, no JavaScript developers (except for the occasional tag change, maybe). And every week, the mysterious combination of analyst plus tags produces a report showing how sales and traffic figures went up, down or sideways.

By contrast, testing requires considerable resource. The design team will need to provide imagery and graphics, guidance on page design and so on. The JavaScript developers will need to put mboxes (or the test code) around the test area; the web content team will also need to understand the changes and make them as necessary. And that's just for one test. If you're planning to build up a test program (and you will be, in time) then you'll need to have the support teams available more frequently. So - what are the benefits of testing? And how do you sell them to management, when they're looking at the list of resources that you're asking for?

How to sell testing to management

1. Testing provides the opportunity to do that: test something that the business is already thinking of changing. A change of banners? A new page layout? As an analyst, you'll need to be ahead of the change curve to do this, and aware of changes before they happen, but if you get the opportunity then propose to test a new design before it goes live. This has the advantage of having most of the resource overhead already taken into account (you don't need to design the new banner/page) but it has one significant disadvantage: you're likely to find that there's a major bias towards the new design, and management may just go ahead and implement anyway, even if the test shows negative results for it.

2. A good track record of analytics wins will support your case for testing. You don't have to go back to prior analysis or recommendations and be as direct as, "I told you so," but something like, "The changes made following my analysis and recommendations on the checkout pages have led to an improvement in sales conversion of x%." is likely to be more persuasive. And this brings me neatly on to my next suggestion.

3. Your main aim in selling testing is to ensure you can get the money for testing resources, and for implementation. As I mentioned above, testing takes time, resource and expertise - or, to put it another way, money. So you'll need to persuade the people who hold the money that testing is a worthwhile investment. How? By showing a potential return on that investment.

"My previous recommendation was implemented and achieved a £1k per week increase in revenue. Additionally, if this test sees a 2% lift in conversion, that will be equal to £3k per week increase in revenue."

It's a bit of a gamble, as I've mentioned previously in discussing testing - you may not see a 2% lift in conversion, it may go flat or negative. But the main focus for the web channel management is going to be money: how can we use the site to make more money? And the answer is: by improving the site. And how do we know if we're improving the site? Because we're testing our ideas and showing that they're better than the previous version.

You do have the follow-up argument (if it does win), that, "If you don't implement this test win it will cost..." because there, you'll know exactly what the uplift is and you'll be able to present some useful financial data (assuming that yesterday's winner is not today's loser!). Talk about £, $ or Euros... sometimes, it's the only language that management speak.

4. Don't be afraid to carry out tests on the same part of a page. I know I've covered this previously - but it reduces your testing overhead, and it also forces you to iterate. It is possible to test the same part of a page without repeating yourself. You will need to have a test program, because you'll be testing on the same part of a page, and you'll need to consult your previous tests (winners, losers and flat results) to make sure you don't repeat them. And on the way, you'll have chance to look at why a test won, or didn't, and try to improve. That is iteration, and iteration is a key step from just testing to having a test program.

5. Don't be afraid to start by testing small areas of a page. Testing full-page redesigns is lengthy, laborious and risky. You can get plenty of testing mileage out of testing completely different designs for a small part of a page - a banner, an image, wording... remember that testing is a management expense for the time being, not an investment, and you'll need to keep your overheads low and have good potential returns (either financial, or learning, but remember that management's primary language is money).

6. Document everything! As much as possible - especially if you're only doing one or two tests at a time. Ask the code developers to explain what they've done, what worked, what issues they faced and how they overcame them. It may be all code to you, but in a few months' time, when you're talking to a different developer who is not familiar with testing and test code, your documentation may be the only thing that keeps your testing program moving.

Also - and I've mentioned this before - document your test designs and your results. Even if you're the only test analyst in your company, you'll need a reference library to work from, and one day, you might have a colleague or two and you'll need to show them what you've done before.

So, to wrap up - remember - it's not a problem if somebody agrees to implement a proposed test. "No, we won't test that, we'll implement it straight away." You made a compelling case for a change - subsequently, you (representing the data) and management (representing gut feeling and intuition) agreed on a course of action. Wins all round.

Setting up a testing program and getting management involvement requires some sales technique, not just data and analysis, so it's often outside the analyst's usual comfort zone. However, with the right approach to management (talk their language, show them the benefits) and a small but scalable approach to testing, you should - hopefully - be on the way to setting up a testing program, and then helping your testing program to gain momentum.

Similar posts I've written about online testing

How many of your tests win?

How long should I run my test for?

The Hierarchy of A/B Testing

Monday, 24 June 2013

Iterating, Creating, Risk and Reward - Discussion

My second huddle at XChange 2013 Berlin looked at what to test, how to set up a testing program and how to get management buy-in. We talked about the best way to get a test program set up, how to achieve critical mass and how to build momentum for an online testing program.

I was intending to revisit some of the topics from my earlier post on creative versus iterative testing, but the discussion (as with my first huddle on yesterday's winner, today's loser) very quickly went off on a tangent and never looked back!

There are a number of issues in either starting or building a testing program - here are a few that we discussed:

Lack of management buy-in
Selling web analytics and reporting is not always easy, especially if you're working in (or with) a company that's largely focused on high-street bricks-and-mortar presence, or if the company is historically telephone or catalogue. Trying to sell the idea of online testing can be very tricky indeed. "Why should we test - we know what's best anyway!" is a common response, but the truth is that intuition is rarely right 100% of the time; here are few counter-arguments that you may (or may not) want to try:

"Would you like to submit your own design to include in the test?"
"Could you suggest some other ideas for improving this banner/button/page?"
"Do you think there is a different way we could improve the page and reach/exceed our sales target?"

Other ways of getting management (and other staff, colleagues and stakeholders) to engage with the test is to ask them to guess which recipe or design will win - and put their names to it. If you can market this well, then very quickly, people will start asking how the test is going, if their design is winning. Better still, if their design is losing, they'll probably want to know why, and might even start (1) interrogating the data and (2) designing a follow-up test.

As we commented during our discussion, it's worth saying that you may need to distinguish between a bad recipe and a good manager. "Yes, you are still a good analyst or manager or designer, it's just that people didn't like your design."

Lack of resource
This could be a lack of IT support, design support or JavaScript developer time. Almost all tests are dependent on some sort of IT and design support (although I have heard of analysts and testers testing their own Photoshop creative). It's difficult - as we'll see below - because without design support, you are restricted in what you can test. However, there are a number of test areas that you can work on which are light on design, are light on code maintenance, and which could potentially show useful (and even positive) test results.

- banner imagery - to include having people or no people; a picture of the product or no product
- banner wording - buy-one-get-one-free, or two-for-one, or 50% off? Or maybe even 'Half price'? Wording will probably require even less design work than imagery, and you (as the tester, or analyst) may even be able to set this one up yourself.
- calls to action - Continue? Add to cart? Add to basket? Select? Make payment? This site has a huge gallery of continue shopping buttons, (when a customer has added an item to basket, and you want to persuade them to keep shopping). There are some suggestions on which may work best - and they don't even change the wording. There are many other things to try - colour; arrow or no arrow; CAPITALS or Initial Capitals?

The advantage of these tests is that they can be carried out on the same area of the same page - once the test code has been inserted by the IT or JavaScript teams, you can set up a series of tests just by changing the creative that is being tried. Many of those in the huddle said that once they had obtained a winner, they would then push that to 100% traffic through the testing software until the next test was ready - further reducing the dependency on IT support.

How to sell flat results
There is nothing worse for an analyst or tester to find out that the test results are flat (there's no significant difference in the performance of the test recipes - all the results were the same). The test has taken months to sell, weeks to design and code, and a few weeks to run, and the results say that there's no difference between the original version (which may have had management backing) and your new analytics-backed version. And what do you get? "You said that online testing would improve our performance by 2%, 5%, 7.5%..."

Actually, the results only appear to say there's no difference... so it's time to do some digging!

Firstly, was the difference between the two test recipes large enough and distinct enough? One member of the huddle quoted the Eisenberg brothers: "If you ask people if they prefer green apples or red apples, you're unlikely to get a difference. If you ask them if they prefer apples or chocolate, you'll see a result."

This is something to consider before the test - are the recipes different enough? It's not always easy to say in advance (!) and there is a greater risk of the test recipe losing if the design is too different, but that's the point - iterating is 'safer' than creating, but does include the possibility that it may go flat. How much risk you're prepared to take may depend on external factors such as how much design resource you can obtain and how important it is to get a non-zero result.

Secondly: analysing flat results will require some concerted data analysis. Overall, the number of orders for the two recipes, and the average order value were the same...

But how many people clicked on the new banner? Or how many people bounced or exited from the test page?

Did you get more people to click on your new call-to-action button - and then those people left at the next page? Why?

Did the banner work better for higher-value customers, who then left on the next page because the item they were actually looking for wasn't featured? Did all visitor segments behave in the same way?

Was there a disconnect between the call to action and the next page? Was the next page really what people would have expected?

Did you offer a 50%-off deal but then not make it clear in the checkout process? It's human nature to study and review a test loss, to accept a win without too much study and to completely write off a flat result, but by applying the same level of rigour to a 'flat' result as to a loss, it's still possible to learn something valuable.

How do you set up a testing program?
We discussed how managers and clients generally prefer to start a testing program in the checkout process - it's a nice, easy, linear funnel with plenty of potential for optimisation, and it's very close to the money. If you improve a checkout page, then the financial metrics will automatically improve as a result.

But how do you test in the product description pages, where visitors browse around before selecting an item? We talked about page purpose: what is the idea of a page? What's the main action that you want a user to take after they have seen this page? Is it to complete a lead generation form? Is it to call the sales telephone line? Is it to 'add to cart'? The success metric is for the page should be the key success metric for the test. You'll need to keep an eye on the end-of-funnel metrics (conversion, order value, and so on) but providing those are flat or trending positively, then you can use the page-purpose metrics to measure the success of your test. If you're tracking an offline conversion (call the sales line, for example) then you'll need to do some extra preparatory work, for example by setting up one telephone line per recipe and then arrange to track the volumes of telephone calls - but it'll make the test result more useful.

Tracking page-purpose success metrics will also enable to you to run tests more quickly. If you can see a definite, confident lift in a page-purpose metrics, while the overall financial metrics are flat or positive, then you can call a winner before you reach confidence in the overall metrics. The further you are from the checkout process (and the final order page), the longer it is likely to take for an uplift in page performance to filter through to the financial results (in terms of testing time), but you can be happy that you are improving your customers' experience.

Documentation
Another valuable way of helping to build a testing program, and enabling it to develop, is to document your tests. When a test is completed, you'll probably be presenting to the management and the stakeholders - this is also a great opportunity to present to the people who contributed to your test: the designers, the developers, IT and so on. This applies especially if the test is a winner!

When the presentation is completed, file the results deck on a network drive, or somewhere which is widely accessible. Start to build up a list of test recipes, results and findings. We discussed if this is a worthwhile exercise - it's time-consuming, laborious and if there's only one analyst working on the test program, it seems unnecessary.

However, this has a number of benefits:

- you can start to iterate on previous tests (winners, losers and flat results), and this means that future tests are more likely to be successful ("We did this three weeks ago and the results were good, less try to make them even better")

- you can avoid repeating tests, which is a waste of time, resource and energy ("We did this two months ago and the results were negative")

- you can start to understand your customers' behaviour and design new tests (based on the data) which are more likely to win. ("This test showed our visitors preferred this... therefore I suspect they will also prefer this...).

It's also useful when and if the team starts to grow (which is a positive result of a growing testing program) as you can share all the previous learnings.

These benefits will help the testing program gain momentum, so that you can start iterating and spend less time repeating yourself. Hopefully, you'll find that you have fewer meetings where you have to sell the idea of testing - you can point back at prior wins and say to the management, "Look, this worked and achieved 3% lift," and, if you're feeling brave, "And look, you said this recipe would win and it was 5% below the control recipe!"

The discussion ran for 90 minutes, and we discussed even more than this... I just wish I'd been able to write it all down. I'd like to thank all the huddle participants, who made this a very interesting and enjoyable huddle!

Friday, 17 May 2013

A/B testing - how long to test for?

So, your test is up and running! You've identified where to test and what to test, and you are now successfully splitting traffic between your test recipes. How long do you keep the test running, and when do you call a winner? You've heard about statistical significance and confidence, but what does it actually mean?

Anil Batra has recently posted on the subject of statistical significance, and I'll be coming to his article later, but for now, I'd like to begin with an analogy.

An analogy of A/B testing and Analysis

Let us suppose that two car manufacturers, Red-Top and Blue-Bottle have each been working on a new car design for the Formula 1 season, and each manufacturer believes that their car is the fastest at track racing. The solution to this debate seems easy enough - put them against each other, side-by-side - one lap of a circuit, first one back wins. However, neither team is particularly happy with this idea - there's discussion of optimum racing line, getting the apex of the bends right, and different acceleration profiles. It's not going to be workable.

Some bright scientist suggests a time trial: one lap, taken by each car (one after the other) and the quickest time wins. This works, up to a point. After all, the original question was, "Which car is the fastest for track racing?" and not, "Which car can go from a standing start to complete a lap quickest?" and there's a difference between the two. Eventually, everybody comes to an agreement: the cars will race and race until one of them has a clear overall lead - 10 seconds (for example), at the end of a lap. For the sake of this analogy, the cars can start at two different points on the circuit, to avoid any of the racing line issues that we mentioned before. We're also going to ignore the need to stop for fuel or new tyres, and any difference in the drivers' ability - it's just about the cars. The two cars will keep racing until there is a winner (a lead of 10 seconds) or until the adjudicators agree that neither car will accrue an advantage that large.

So, the two cars set off from their points on the circuit, and begin racing. The Red-Top car accelerates very quickly from the standing start, and soon has a 1-second lead on the Blue-Bottle. However, the Blue-Bottle has better brakes which enable it to corner better, and after 20 laps there's nothing in it. The Blue-Bottle continues to show improved performance, and after 45 laps, the Blue-Bottle has built a lead of 6.0 seconds. However, the weather changes from sunny to overcast and cloudy, and the Blue-Bottle is unable to extend its lead over the next 15 laps. The adjudicators call it a day after 60 laps total.

So, who won?

There are various ways of analysing and presenting the data, but let's take a look at the data and work from there. The raw data for this analysis is here: Racing Car Statistical Significance Spreadsheet.

This first graph shows the lap times for each of the 60 laps:

This first graph tells the same story as the paragraphs above: laps 1-20 show no overall lead for either car; the blue car is faster from laps 20-45, then from laps 45-60 neither car gains a consistent advantage. This second graph shows the cumulative difference between the performance of the two cars. It's not one that's often shown in online testing tools, but it's a useful way of showing which car is winning. If the red car is winning, then the time difference is negative; if the blue car is ahead, the time difference is positive, and the size of the lead is measured in seconds.

Graph 3, below, is a graph that you will often see (or produce) from online testing tools. It's the cumulative average report - in this case, cumulative average lap time. After each lap, the overall average lap time is calculated for all the laps that have been completed so far. Sometimes called performance 'trend lines', these show at a glance a summary of which car has been winning, which car is winning now, and by how much. Again, to go back to the original story, we can see how for the first 20 laps, the red car is winning; at 20 laps, the red and blue lines cross (indicating a change in the lead, from red to blue); from laps 20 to 45 we see the gap between the two lines widening, and then how they are broadly parallel from laps 45 to 60.

So far, so good. Graph 4, below, shows the distribution of lap times for the two cars. This is rarely seen in online testing tools, and looks better suited to the maths classroom. With this graph, it's not possible to see who was winning, when, but it's possible to see who was winning at the end. This graph, importantly, shows the difference in performance in a way which can be analysed mathematically to show not only which car was winning, but how confident we can be that it was a genuine win, and not a fluke. We can do this by looking at the average (mean) lap time for each car, and also at the spread of lap times.

This isn't going to become a major mathematical treatment, because I'm saving that for next time :-) However,you can see here that on the whole, the blue car's lap times are faster (the blue peak is to the left, indicating a larger number of shorter lap times) but are slightly more spread out - the blue car has both the fastest and slowest times.

The maths results are as follows:

Overall -
Red:
Average (mean) = 102.32 seconds.
Standard deviation (measure of spread) = 0.21
Blue: average (mean) = 102.22 seconds (0.1 seconds faster per lap).
Standard deviation = 0.28 seconds (lap times are spread more widely)

Mathematically, if the average times for the cars are two or more standard deviations apart, then we can say with 99.99% confidence that the results are significant (i.e. are not due to noise, fluke or random chance). In this case, the results are only around half a standard deviation apart, so it's not possible to say that either car is really a winner.

But hang on, the blue car was definitely winning after 60 laps. The reason for this is its performance between laps 20 and 45, when it was consistently building a lead over the red car (before the weather changed, in our story). Let's take a look at the distribution of results for these 26 laps:

A very different story emerges. The times for both cars have a much smaller spread, and the peak for the blue distribution is much sharper (in English, the blue car's performance was much more consistent from lap to lap). Here are the stats for this section of the race:

Red:
Average (mean) = 102.31 seconds
Standard deviation (measure of spread) = 0.08
Blue: average (mean) = 102.08 seconds (0.23 seconds faster per lap)
Standard deviation = 0.11 seconds (lap times have a narrower distribution)

We can now see how the Blue car won; over the space of 26 laps, it was faster, and more consistently faster too. The difference between the two averages = 102.31 - 102.08 = 0.23 seconds, and this is over twice the standard deviation for the blue car (0.11 x 2 = 0.22). Fortunately, most online testing tools will give you a measure of the confidence in your data, so you won't have to get your spreadsheet or calculator out and start calculating standard deviations manually.

Now, here's the question: are you prepared to call the Blue car a clear winner, based on just part of the data?

Think about this in terms of the performance of an online test between two recipes, Blue and Red. Would you have called the Red recipe a winner after 10-15 days/laps? In the same way as a car and driver need time to settle down into a race (acceleration etc), your website visitors will certainly need time to adjust to a new design (especially if you have a high proportion of repeat visitors). How long? It depends :-)

In the story, the Red car had better acceleration from the start, but the Blue car had better brakes. Maybe one of your test recipes is more appealing to first time visitors, but the other works better for repeat visitors, or another segment of your traffic. Maybe you launched the test on a Monday, and one recipe works better on weekends?

So why did the results perform differently between laps 20-45 and 45-60? Laps 20-45 are 'normal' conditions, whereas after lap 45, something changed, and n the racing car story, it was due to the weather. In the online environment, it could be a marketing campaign that you just launched, or your competitors launched. Maybe a new product, or the start of national holiday, school holiday, or similar? From that point onward, the performance of the Blue recipe was comparable or identical to the Red.

So, who won? The Blue car, since its performance in normal conditions was better. It took time to settle down, but in a normal environment, it's 0.23 seconds faster per lap, with 99+% confidence. Would you deploy the equivalent Blue recipe in an online environment, or do you think it's cheating to only deploy a winner that is better only during normal conditions, and is just comparable to the Red recipe during campaign periods? :-)

Let's take a look at Anil Batra's post on testing and significance. It's a much briefer article than mine (I apologise for the length, and thank you for your patience), but it does explain that you shouldn't stop a test too early. The question that many people ask is - how long do you let it run for? And how do you know when you've got a winner (or is everything turning flat?)? The short article has a very valid point: don't stop too soon!

Next time - a closer, mathematical look at standard deviations, means and distributions, and how they can help identify a winner with confidence! In the meantime, if you're looking for a more mathematical treatment, I recommend this one from the Online Marketing Tests blog. I've also written a simple treatment of confidence and significance, and one which has a more mathematical approach to confidence.

Monday, 15 April 2013

A/B Testing: Where to Test?

You've bought the software, you've even read the manual and a few books or blogs about testing, and now you're ready to test. Last time, I discussed how to design your test, and in this post, I'd like to look at where to test. Which pages are you going to test on? There's no denying that some tests are easier to build, develop and write the code for, and some pages will be trickier (especially if they're behind secure firewalls or if the page is largely hard-coded with little scope for inserting JavaScript), but there's definitely a group of pages that are good for testing.

Why? Because an improvement in the financial performance of some of the key pages of your site will have a dramatic impact on the overall performance of your site.

Here are a few good examples of places where testing is likely to be financially productive:

1. Test landing pages with a high bounce rate

Bounce rate is defined as the number of people who land on your site and then click away without visiting any other pages, divided by the total number who landed. More technically, it's the number of single-page-visits divided by the total number of entries. Landing pages - especially your home page or a campaign landing page - are some of the mostly highly trafficked pages on your site. For this reason, small improvements in bounce rate or on click-through rates on landing page calls to action will help to move your financials. In particular, if your cost per acquisition is high, or the page has a high entrance rate combined with a high bounce rate, then improving page performance here will help improve your financial figures.

2. Leaky funnels

If you have a linear payment process (and who doesn't?) then you can monitor page-to-page conversion in a linear way. If one page is "leaking" - i.e. people are leaving when they reach that particular page, then that's a definite area to look at. Revisit the page yourself, and generate some ideas to help improve the page's performance. Why are people leaving? What's missing? What's getting in the way? Where are they going - are they leaving the site or going back to another page on your site? Which page? WHY?

3. Test pages with high exit rates

People have to leave your site - it's a matter of fact. The question is - are they leaving at appropriate exit points, or are they leaving too early? Some pages on your site are destination pages, and that's not just the 'thank you for your order' page. There are other pages where visitors are able to identify product features, find out what they want to know, or download a PDF. These are all acceptable exit pages, and a high exit rate on these pages is probably not a bad thing. Just to explain - the exit rate is the number of exits from a page, divided by the number of page views for the page, typically expressed as a percentage.

However, other pages are navigation pages - section pages, category pages, header pages, hub pages, whatever you choose to call them. The page purpose here is to get people deeper into the site, and if people are leaving on these pages, then visitors are not fulfilling their visit purpose because the pages aren't working properly. This is similar to the leaky funnel for a non-linear path, but in the same way, it indicates that something on the page isn't optimal.

4. Test in response to customer comments.

If you have a survey or feedback mechanism on your site, then take time to read the comments that your visitors have left. Visitors won't necessarily answer your design questions, but their comments can either support am existing test idea you have, she'd light on an issue you've identified with your traffic analysis, or provide you with new test ideas. And they aren't usually hesitant about telling you where the weaknesses in your site are, so be prepared to face some fierce criticism about your site.

The anonymity of a customer survey often leads some visitors to tell you exactly what they think about your site - so don't take it personally! Comments will vary from 'Your site is great' through to 'your site is dreadful' but may take in, 'I can't find the link to track my order,' and 'I can't find spare batteries for my camera' which will help focus your testing efforts.

So, review your stats; check your campaign metrics and listen to what your customers are telling you - you're bound to find some ideas for improving your site, and for testing your own solutions to the problems you've found. Would you agree? Do you have other ways of generating test ideas?

In my next posts in this series, I intend to look at how long to run a test for and explain statistical significance, confidence and when to call a test winner.

Wednesday, 30 January 2013

Testing: Iterating or Creating?

"Let's run a test!" comes the instruction from senior management. Let's improve this page's performance, let's make things better, let's try something completely new, let's make a small change... let's do it like Amazon or eBay. Let's run an A/B test.

In a future post, I'll cover where to test, what to test, and what to look for, but in this post, I'd like to cover how to test. Are you going to test totally new page designs, or just minor changes to copy, text, calls-to-action and pictures? Which is best?

It depends. If you're under pressure to show an improvement in performance with your tests, such as fixing a broken sales funnel, then you are probably best testing small, steady changes to a page in a careful, logical and thoughtful way. Otherwise, you risk seriously damaging your financial performance while the test is running, and not achieving a successful, positive result. By making smaller changes in your test recipes, you are more likely to get performance that's closer to the original recipe - and if your plan and design were sound, then it should also be an improvement :-)

If you have less pressure on improving performance, and iterating seems irritating, then you have the opportunity to take a larger leap into the unknown - with the increased risk that comes with it. Depending on your organisation, you may find that there's pressure from senior management to test a completely new design and get positive results (the situation worsens when they expect to get positive results with their own design which features no thought to prior learnings). "Here, I like this, test it, it should win." At least they're asking you to test it first, instead of just asking you to implement it.

Here, there's little thought to creating a hypothesis, or even iterating, and it's all about creating a new design - taking a large leap into the unknown, with increased risk. Yes, you may hit a winner and see a huge uplift from changing all those page elements; painting the site green and including pictures of the products instead of lifestyle images, but you may just find that performance plummets. It's a real leap into the unknown!

The diagram above represents the idea behind iterative and creative testing. In iterative testing (the red line), each test builds on the ideas that have been identified and tested previously. Results are analysed, recommendations are drawn up and then followed, and each test makes small but definite improvements on the previous. There's slow but steady progress, and performance improves with time.

The blue line represents the climber jumping off his red line and out into the unknown. There are a number of possible results here, but I've highlighted two. Firstly the new test, with the completely untested design, performs very badly, and our climber almost falls off the mountain completely. Financial performance is poor compared to the previous version, and is not suitable for implementation. It may be possible to gain useful learnings from the results (and this may be more than, "Don't try this again!") but this will take considerable and careful analysis of the results.

Alternatively, your test result may accelerate you to improved performance and the potential for even better results - the second blue climber who has reached new heights. It's worth pointing out at this stage that you should analyse the test results as carefully as if it had lost. Otherwise, your win will remain an unknown and your next test may still be a disaster (even if it's similar to the new winner). Look at where people clicked, what they saw, what they bought, and so on. Just because your creative and innovative design won doesn't mean you're off the hook - you still need to work out why you won, just as carefully as if you'd lost.

So, are you iterating or creating? Are you under pressure to test out a new design? Are you able to make small improvements and show ROI? What does your testing program look like - and have you even thought about it?

Tuesday, 31 May 2011

Web Analytics: What makes testing iterative?

What makes testing iterative?

When I was about eight or nine years old, my dad began to teach me the fundamentals of BASIC programming. He'd been on a course, and learned the basics, and I was eager to learn - especially how to write games. One of the first programs he demonstrated was a simple game called Go-Karts. The screen loads up: "You are on a go-kart and the steering isn't working. You must find the right letter to operate the brakes before you crash. You have five goes. Enter letter?" You then enter a letter, and the program works out if the input is correct, or if it's before or after the letter you've entered.

"J"
"After J"
"P"
"Before P"
"L"
"L is correct - you have stopped your go-kart without crashing! Another game (Y/N)?"

I was reminded of this simple game during the Omniture Summit EMEA 2011 last week, when one of the breakout presenters started talking about testing, and in particular ITERATIVE testing. Iterative testing should be the natural development from testing, and I've alluded to it in my previous posts about testing. At its simplest, basic testing involves just comparing one version of a page, creative, banner, text or call-to-action (or whatever) against another and seeing which one works best. Iterative testing works in a similar, but more advanced way, in a way similar to my dad's Go-Karts game: start with an answer which is close to the best, and then build on that answer and start from there to develop something better still. I've talked about coloured shapes as simplified versions of page elements in my previous posts on testing, so I guess it's time to develop a different example!

Suppose I've tested the five following page headlines, and achieved the following points scores (per day), running each one for a week, so that the total test lasted five weeks.

"Cheap hi-quality widgets on sale now" - 135 points
"Discounted quality widgets available now" - 180 points
"Cheap widgets reduced now" - 110 points
"Advanced widgets available now" - 210 points
"Exclusive advanced widgets on sale now" - 200 points

What would you test next?

This question is the kind of open question which will help you to see if you're doing iterative testing, or just basic testing. What can we learn from the five tests that we've run so far? Anything? Or nothing? Do we have make another random guess, or can we use these results to guide us towards something that should do well?

Looking at the results from these preliminary tests, the best headline, "Advanced widgets available now" scored almost twice as many points per day as "Cheap widgets reduced now". At the very worst, we should run with this high-performing headline, which is doing marginally better than the most recent attempt, "Exclusive advanced widgets on sale now." This shouldn't pose a problem for a web development team - after all, the creative has already been designed and just needs to be re-published. All that's needed is to admit that the latest version isn't as good as an earlier idea, and to go backwards in order to go forwards.

Anyway: we can see that "Advanced..." got the best score, and is the best place to start from. We can also see that the two lowest performing headlines include the word "Cheap" so this looks like a word to avoid. From this, it looks like "Advanced widgets on sale now" and "Exclusive advanced widgets available now" are places to start from - we've eliminated the word 'cheap' and now we can look at how 'available now' compares to 'on sale now'. This is the time for developing test variations on these ideas - following the general principles that have been established by the first round of testing. This is not the time for trying a whole set of new ideas; this would mean ignoring all the potential learning and starting to make sub-optimal decisions (as they're sometimes known).

Referring back to my earlier post, this is the time in the process for making hypotheses on your data. I have to disagree with the speaker at the Omniture EMEA Summit, when she gave an example hypothesis as, "We believe that changing the page headline will drive more engagement with the site, and therefore better conversion." This is just a theory. A hypothesis says all that, and then adds, "because visitors read the page headline first when they see a page, and use that as a primary influencer to decide if the page meets their needs."

So, here's a hypothesis on the data: "Including the word 'cheap' in our headline puts visitors off because they're after premium products, not inexpensive ones. We need to focus on premium-type words because these are more attractive to our visitors." In fact - as you can see, I've even added a recommendation after my hypothesis (I couldn't resist).

And that's the foundation of iterative testing - using what's gone before and refining and improving it. Yes, it's possible that a later iteration might throw up results that are completely unexpected - and worse than before - but then that's the time to improve and refine your hypothesis. Interestingly, the less shallow hypotheses will still hold true, "We believe that changing the page headline will drive more engagement with the site, and therefore better conversion." - as it isn't specific enough.

Anyway, that's enough on iterative testing for now; I'm off to go and play my dad's second iteration of the Go-Karts game, which went something like, "You are floating down a river, and the crocodiles are gathering. You must guess how many crocodiles there are in order to escape. How many (1-50)?"

Header tag