Web Optimisation, Maths and Puzzles: xchange

Showing posts with label xchange. Show all posts

Friday, 11 July 2014

Is Multi-Variate Testing Really That Good?

The second discussion that I led at the Digital Analytics Hub in Berlin in June was entitled, "Is Multi Variate Testing Really That Good?" Although only a few delegates attended, it got some good participation from a range of people representing a range of analytical and digital professionals, and in this post I'll cover some of the key points.

- The number of companies using MVT is starting to increase, although it's a very slow increase and it still has only low adoption rates. It's not as widespread as perhaps the tool vendors would suggest.

- The main barriers (real or perceived) to MVT are complexity (in design and analysis) and traffic volumes (multiple recipes require large volumes of traffic in order to get meaningful results in a useful timeframe).

There is an inherent level of complexity in MVT, as I've mentioned before (and one day soon I will explain how to analyse the results) and the tool vendors seem to imply that the test design must also be complicated. It doesn't. I've mentioned in a previous post on MVT that sometimes the visual design of a multi-variate test does not have to be complicated, it can just involve a large number of small changes run simultaneously.

The general view during the discussion was that MVT would have to involve a complicated design with a large number of variations per element (e.g. testing a call-to-action button in red, green, yellow, orange and blue, with five different wordings). In my opinion, this would be complicated as an A/B/n test, so as an MVT it would be extremely complex, and, to be honest, totally unsuitable for an entry-level test.

We spent a lot of our discussion time discussing various pages and scenarios where MVT is totally unsuitable, such as site navigation. A number of online sites have issues with large catalogues and navigation hierarchies, and it's difficult to decide how best to display the whole range of products - MVT isn't the tool to use here, we discussed card-sorting, brainstorming and visualisations instead of A/B testing. This was one of the key lessons for me - MVT is a powerful tool, but sometimes, you don't need a powerful tool, you just need the basic one. A power drill is never going to be good at cutting wood - a basic handsaw is the way to go. It's all about selecting the right tool for the job.

Looking at MVT, as with all online optimisation programs, the best plan is to build up to a full MVT in stages, with initial MVT trials run as pilot experiments. Start with something where the basic concept for testing is easy to grasp, even if the hypothesis isn't great. The problem statement or hypothesis could be, "We believe MVT is a valuable tool and in order to use it, we're going to start with a simple pilot as a proof of concept." And why not? :-)

Banners are a great place to start - after all, the marketing team spend a lot of money on it, and there's nothing quite as eye-catching as a screenshot of a banner in your test report documents and presentations. They're also very easy to explain... let's try an example. Three variables that can be considered are gender of the model (man or woman), wording of the banner text ("Buy now" vs "On Sale") and the colour of the text (black or red).

There are eight possible combinations in total; here are a few potential recipes:

Recipe A	Recipe B
Recipe C	Recipe D

Note that I've tried to keep the pictures similar - model is facing camera, smiling, with a blurred background. This may be a multi-variate test, but I'm not planning to change everything, and I'm keeping track of what I'm changing and what's staying the same!!

Designing a test like this has considerable benefits:
- it's easy to see what's being tested (no need to play 'spot the difference')
- you can re-use the same images for different recipes
- copywriters and merchandisers only need to come up with two lots of copy (which will be less than in an A/B/C/D test with multiple recipes).
- it's not going to take large numbers of recipes, and therefore is NOT going to require a large volume of traffic.

Some time soon, I'll explain how to analyse and understand the results from a multi-variate test, hopefully debunking the myths around how complicated it is.

Here's my series on Multi Variate Testing

Preview of Multi Variate testing
Web Analytics: Multi Variate testing
Explaining complex interactions between variables in multi-variate testing
Is Multi Variate Testing an Online Panacea - or is it just very good?
Is Multi Variate Testing Really That Good - (that's this article)
Hands on: How to set up a multi-variate test
And then: Three Factor Multi Variate Testing - three areas of content, three options for each!

Image credits:
man - http://www.findresumetemplates.com/job-interview
woman - http://www.sheknows.com/living

Tuesday, 24 June 2014

Why Does Average Order Value Change in Checkout Tests?

The first discussion huddle I led at the Digital Analytics Hub in 2014 looked at why average order value changes in checkout tests, and was an interesting discussion. With such a specific title, it was not surprising that we wandered around the wider topics of checkout testing and online optimisation, and we covered a range of issues, tips, troubles and pitfalls of online testing.

But first: the original question - why does average order value (AOV) change during a checkout test? After all, users have completed their purchase selection, they've added all their desired items to the cart, and they're now going through the process of paying for their order. Assuming we aren't offering upsells at this late stage, and we aren't encouraging users to continue shopping, or offering discounts, then we are only looking at whether users complete their purchase or not. Surely any effect on order value should be just noise?

For example, if we change the wording for a call to action from 'Continue' to 'Proceed' or 'Go to payment details', then would we really expect average order value to go up or down? Perhaps not. But, in the light of checkout test results that show AOV differences, we need to revisit our assumptions.

After all, it's an oversimplification to say that all users are affected equally, irrespective of how much they're intending to spend. More analysis is needed to look at conversion by basket value (cart value) to see how our test recipe has affected different users based on their cart value. If conversion is affected equally across all price bands, then we won't see a change in AOV. However, how likely is that?

Other alternatives: perhaps there's no real pattern in conversion changes: low-price-band, mid-price-band, high-price-band and ultra-high-price-band users show a mix of increases and decreases. Any overall AOV change is just noise, and the statistical significance of the change is low.

But let's suppose that the higher price-band users don't like the test recipe, and for whatever reason, they decide to abandon. The AOV for the test recipe will go down - the spread of orders for the test recipe is skewed to the lower price bands. Why could this be? We discussed various test scenarios:

- maybe the test recipe missed a security logo? Maybe the security logo was moved to make way for a new design addition - a call to action, or a CTA for online chat - a small change but one that has had significant consequences.

- maybe the test recipe was too pushy, and users with high ticket items felt unnecessarily pressured or rushed? Maybe we made the checkout process feel like express checkout, and we inadvertantly moved users to the final page too quickly. For low-ticket items, this isn't a problem - users want to move through with minimum fuss and feel as if they're making rapid progress. Conversely, users who are spending a larger amount want to feel reassured by a steady checkout process which allows the user to take time on each page without feeling rushed?

- sometimes we deliberately look to influence average order value - to get users to spend more, add another item to their order (perhaps it's batteries, or a bag, or the matching ear-rings, or a warranty). No surprises there then, that average order value is influenced; sometimes it may go down, because users felt we were being too pushy.

Here's how those changes might look as conversion rates per price band, with four different scenarios:

Scenario 1: Conversion (vertical axis) is improved uniformly across all price bands (low - very high), so we see a conversion lift and average order value is unchanged.

Scenario 2: Conversion is decreased uniformly across all price bands; we see a conversion drop with no change in order value.

Scenario 3: Conversion is decreased for low and medium price bands, but improved for high and very-high price bands. Assuming equal order volumes in the baseline, this means that conversion is flat (the average is unchanged) but average order value goes up.

Scenario 4: Conversion is improved selectively for the lowest price band, but decreases for the higher price bands. Again, assuming there are similar order volumes (in the baseline) for each price band, this means that conversion is flat, but that average order value goes down.

There are various combinations that show conversion up/down with AOV up/down, but this is the mathematical and logical reason for the change.

Explaining why this has happened, on the other hand, is a whole different story! :-)

Wednesday, 19 June 2013

Why is yesterday's test winner today's loser?

This post comes out of the xChange Berlin huddle which I led on 11 June 2013. xChange is very different from most web analytics conferences - most conferences have speakers and presentations, but xChange is focused around web analytics professionals meeting and discussing in small workshop groups. As the xChange website describes it:
"Expressly designed for enterprise analytics managers and digital marketing and measurement practitioners, X Change brings together top professionals in the field in a no-sales, all business, peer-to-peer environment for deep-dives into cutting edge online measurement topics."

At xChange Berlin 2013, I led two huddle groups - this was the first, entitled, "Why is yesterday's test winner today's loser?". I haven't attributed the content here to any particular participant - this is just a summary of our discussions. I should say now that the discussion was not even close to what I'd anticipated, but was even more interesting as a result!

The discussion kicked off with a review of a test win. Let's suppose that you have run your A/B test, and you have a winner. You ran it for long enough to achieve statistical significance and even achieved consistent trend lines. But somehow, when you implemented it, your financial metrics didn't show the same level of improvement as your test results. And now, the boss has come to your desk to ask if your test was really valid. "What happened? Why is yesterday's test winner today's loser?"

There are a number of reasons for this - let's take a look.

External factors
Yes, A/B tests split your traffic evenly between the test recipes, so that most external factors are accounted for. But what happens if your test was running while you had a large-scale TV campaign, or display or PPC campaign? Yes, that traffic would have been split between your test recipes, so the effect is - apparently - mitigated. But what if the advertising campaign resonated with your test recipe, which went on to win. During the non-campaign period, the control recipe would be better, or perhaps the results would have been more similar. Consequently, the uplift that you saw during the test would not be achieved in normal conditions.

Customer Experience Changes
When we start a test, there is quite often a dip in performance for the test recipe. It's new. It's unfamiliar and users have to become accustomed to it. It often takes a week or so for visitors to get used to it, and for accurate, meaningful and useful test results to develop. In particular, frequent repeat visitors will take some time to adjust to the changes (how often repeat visitors return will depend on your site). The same issue applies when you implement a winner - now, the whole population is seeing a new design, and it will take some time for them to adjust.

Visitor Segments
Perhaps the test recipe worked especially well with a particular visitor segment? Maybe new visitors, or search visitors, or visitors from social media, and that was responsible for the uplift. You have assumed (one way or another) that your population profile is fairly constant. But if you identify that your test recipe won because one or two segments really engaged with it, then you may not see the uplift if your population profile changes. What should you do instead? Set up a targeting implementation: target specific visitors, based on your test results, who engaged more (or converted better) with the test recipe. Show everybody else the same version of your site as usual, but for visitors who fit into a specific segment - show them the test recipe. I'll discuss targeting again at a later date, but here's a post I wrote a few months ago about online personalisation.

Time lapse between test win and implementation
This varied around the members of the group - where a company has a test plan, and there's a need to get a test up and running, it may not be possible to implement straight away. It also depends on what's being tested - can the test recipe be implemented immediately through the site team or CMS, or will it require IT roadmap work? Most of the group would use either the testing software (for example, Test and Target, or Visual Website Optimiser) and immediately set a winning recipe to 100% traffic (or 95%) until the change could be made permanently. Setting a winning recipe to 95% instead of 100% in effect enables the test to run for longer - you can continue to show that the test recipe is winning. It also means that visitors who were in the control group during the test (i.e. saw "Recipe A") will continue to see that recipe until the implementation is complete - better customer experience for that group? Something to think about!

My next post will be about the second huddle that I led, which was based on iterating vs creating. The title came from my recent blog post on iterative testing, but the discussion went in a very different direction, and again, was better for it!

Header tag