Header tag

Thursday, 12 March 2015

Pitfalls of Online Optimisation and Testing 3: Discontinuous Testing

Some forms of online testing are easy to set up, easy to measure and easy to interpret.  The results from one test point clearly to the next iteration, and you know exactly what's next.  For example, if you're testing the height of a banner on a page, or the size of the text that you use on your page headlines, there's a clear continuous scale from 'small' to 'medium' to 'large' to 'very large'.  You can even quantify it, in terms of pixel dimensions.  With careful testing, you can identify the optimum size for a banner, or text, or whatever it may be.  I would describe this as continuous testing, and it lends itself perfectly to iterative testing.

Some testing - in fact, most testing - is not continuous.  You could call it discrete testing, or digital testing, but I think I would call it discontinuous testing.

For example:
colours (red vs green vs black vs blue vs orange...)
title wording ("Product information" vs "Product details" vs "Details" vs "Product specification")
imagery (man vs woman vs family vs product vs product-with-family vs product alone)

Both forms of testing are, of course, perfectly valid.  The pitfall comes when trying to iterate on discontinuous tests, or trying to present results, analysis and recommendations to management.  The two forms can become confused, and unless you have a strong clear understanding of what you were testing in the first place - and WHY you tested it - you can get sidetracked into a testing dead-end. 


For example; let's say that you're testing how to show product images on your site.  There are countless ways of doing this, but let's take televisions as an example.  On top right is an image borrowed from the Argos website; below right is one from Currys/PC World. The televisions are different, but that's not relevant here; I'm just borrowing the screenfills and highlighting them as the main variable.  In 'real life' we'd test the screenfills on the same product.

Here's the basis of a straightforward A/B test - on the left, "City at Night" and on the right, "Winter Scene".  Which wins? Let's suppose for the sake of argument that the success metrics is click-through rate, and "City at Night" wins.  How would you iterate on that result, and go for an even better winner?  It's not obvious, is it?  There are too many jumps between the two recipes - it's discontinuous, with no gradual change from city to forest.

The important thing here (I would suggest) is to think beforehand about why one image is likely to do better than the other, so that when you come to analyse the results, you can go back to your original ideas and determine why one image won and the other lost.  In plain English:  if you're testing "City at Night" vs "Winter Scene", then you may propose that "Winter Scene" will win because it's a natural landscape vs an urban one.  Or perhaps "City at Night" is going to win because it showcases a wider range of colours.  Setting out an idea beforehand will at least give you some guidance on how to continue.

However, this kind of testing is inherently complex - there are a number of reasons why "City at Night" might win:
- more colours shown on screen
- showing a city line is more high-tech than a nature scene

- stronger feeling of warmth compared to the frozen (or should that be Frozen) scene

In fact, it's starting to feel like a two-recipe multi-variate test; our training in scientific testing says, "Change one thing at a time!" and yet in two images we're changing a large number of variables.  How can we unpick this mess?

I would recommend testing at least two or three test recipes against control, to help you triangulate and narrow down the possible reasons why one recipe wins and another loses. 

Shown on the right are two possible examples for a third and fourth recipe which might start to narrow down the reasons, and increase the strength of your hypothesis.
  
 
 If the hypothesis is that "City at Night" did better because it was an urban scene instead of a natural scene, then "City in Daylight" (top right) may do even better.  This has to be discontinuous testing - it's not possible to test the various levels of urbanisation; we have to test various steps along the way in isolation.

Alternatively, if "City at Night" did better because it showcased more colours, then perhaps "Mountain View" would do better - and if "Mountain View" outperforms "Winter Scene", where the main difference is the apparent temperature of the scene (warm vs cold), then warmer scenes do better, and a follow-up would be a view of a Caribbean holiday resort. And there you have it - perhaps without immediately realising, the test results are now pointing towards an iteration with further potential winners. 

By selecting the test recipes carefully and thoughtfully and deliberately aiming for specific changes between them, it's possible to start to quantify areas which were previously qualitative.  Here, for example, we've decided to focus (or at least try to focus) on the type of scene (natural vs urban) and on the 'warmth' of the picture, and set out a scale from frozen to warm, and from very natural to very urban.  Here's how a sketch diagram might look:



Selecting the images and plotting them in this way gives us a sense of direction for future testing.  If the city scenes both outperform the natural views, then try another urban scene which - for example - has people walking on a busy city street.  Try another recipe set in a park area - medium population density - just to check the original theory.  Alternatively, if the city scenes both perform similarly, but the mountain view is better than the winter scene (as I mentioned earlier), then try an even warmer scene - palm trees and a tropical view.

If they all perform exactly similarly, then it's time to try a different set of axes (temperature and population density don't seem to be important here, so it's time to start brainstorming... perhaps pictures of people and racing cars are worth testing?).

Let's take another example:  on-page text.  How much text is too much text, and what should you say? How should you greet users, what headlines should you use?  Should you have lengthy paragraphs discussing your product's features, or should you keep it short and concise - bullet points with the product's main specifications?

Which is better, A or B?  And (most importantly) - why?  (Blurb borrowed and adapted from Jewson Tools)


A: 

Cordless drills give you complete flexibility without compromising on power or performance.  We have a fantastic range, from leading brands such as AEG, DeWalt, Bosch, Milwaukee and Makita.  This extensive selection includes tools with various features including adjustable torque, variable speeds and impact and hammer settings. We also sell high quality cordless sets that include a variety of tools such as drills, circular saws, jigsaws and impact drivers. Our trained staff in our branches nationwide can offer expert technical advice on choosing the right cordless drill or cordless set for you.

B:
* Cordless drills give you complete flexibility without compromising on power or performance.
* We stock AEG, DeWalt, Bosch, Milwaukee and Makita
* Selection includes drills with adjustable torque, variable speeds, impact and hammer settings
* We also stock drills, circular saws, jigsaws and impact drivers
* Trained staff in all our stores, nationwide


If A was to win, would it because of its readability?  Is B too short and abrupt?  Let's add a recipe C and triangulate again:

C:
* Cordless drills - complete flexibility

* Uncompromised performance with power
* We stock AEG, DeWalt, Bosch, Milwaukee and Makita
* Features include adjustable torque, variable speed, impact and hammer settings
* We stock a full range of power tools
* Nationwide branches with trained staff

 C is now extremely short - reduced to sub-sentence bullet points.  By isolating one variable (the length of the text) we can hope to identify which is best - and why.  If C wins, then it's time to start reducing the length of your copy.  Alternatively, if A, B and C perform equally well, then it's time to take a different direction.  Each recipe here has the same content and the same tone-of-voice (it just says less in B and C); so perhaps it's time to add content and start to test less versus more.


D:
* Cordless drills - complete flexibility with great value

* Uncompromised performance with power
* We stock AEG, DeWalt, Bosch, Milwaukee and Makita
* Features include adjustable torque, variable speed, impact and hammer settings
* We stock a full range of power tools to suit every budget
* Nationwide branches with trained and qualified staff to help you choose the best product
* Full 30-day warranty
* Free in-store training workshop  

E: 
* Cordless drills provide complete flexibility

* Uncompromised performance
* We stock all makes

* Wide range of features

* Nationwide branches with trained staff

In recipe D, the copy has been extended to include 'great value'; 'suit every budget', training and warranty information - the hypothesis would be that more is more, and that customers want this kind of after-sales support.  Maybe they aren't - maybe your customers are complete experts in power tools, in which case you'll see flat or negative performance.  In Recipe E, the copy has been cut to the minimum - are readers engaging with your text, or is it just there to provide context to the product imagery?  Do they already know what cordless drills are, what they do, and are they just buying another one for their team?

So, to sum up:  it's possible to apply scientific and logical thinking to discontinuous testing - the grey areas of optimisation.  I'll go for a Recipe C/E approach to my suggestions:

*  Plan ahead - identify variables (list them all)

*  Isolate variables as much as possible and test one or two
*  Identify the differences between recipes 
*  Draw up a continuum on one or two axes, and plot your recipes on it
*  Think about why a recipe might win, and add another recipe to test this theory (look at the continuum)

The  articles in the Pitfalls of Online Optimisation and Testing series

Article 1:  Are your results really flat?
Article 2: So your results really are flat - why?  
Article 3: Discontinuous Testing


Wednesday, 11 February 2015

Pitfalls of Online Optimisation and Testing 2: Spot the Difference

The second pitfall in online optimisation that I would like to look at is why we obtain flat results - totally, completely flat results at all levels of the funnel.  All metrics show the same results - bounce rate, exit rate, cart additions, average order value, order conversion. There is nothing to choose between the two recipes, despite a solid hypothesis and analytics which support your idea.

The most likely cause is that the changes you made in your test recipe were just not dramatic enough.  There are different types of change you could test:
 
*  Visual change (the most obvious) 
*  Path change (where do you take users who click on a "Learn more" link?)
*  Interaction change (do you have a hover state? Is clicking different from hovering? How do you close a pop-up?)


Sometimes, the change could be dramatic but the problem is that it was made on an insignificant part of the site or page.  If you carried out an end-to-end customer journey through the control experience and then through the test experience, could you spot the difference?  Worse still, did you test on a page which has traffic but doesn't actively contribute to your overall sales (is its order participation virtually zero?)?
Is your hypothesis wrong? Did you think the strap line was important? Have you in fact discovered that something you thought was important is being overlooked by visitors?
Are you being too cautious - is there too much at stake and you didn't risk enough? 

Is the part of the site getting traffic? And does that traffic convert? Or is it just a traffic backwater or a pathing dead end?  It could be that you have unintentionally uncovered an area of your site which is not contributing to site perofrmance.

Do your success metrics match your hypothesis? Are you optimising for orders on your Customer Support pages? Are you trying to drive down telephone sales?
Some areas of the site are critical, and small changes have big differences. On the other hand, some parts of the site are like background noise that users filter out (which is a shame when we spend so much time and effort selecting a typeface, colour scheme and imagery which supports our brand!). We agonise over the photos we use on our sites, we select the best images and icons... And they're just hygiene factors that users barely glance at.  On the other hand, there are some parts that are critical - persuasive copy, clear calls to action, product information and specifications.  What we need to know, and can find out through our testing, is what matters and what doesn't.

Another possibility is that you made two counter-acting changes - one improved conversion, and the other worsened it, so that the net change is close to zero. For example, did you make it easier for users to compare products by making the comparison link larger, but put it higher on the page which pushed other important information on the page to a lower position, where it wasn't seen?  I've mentioned this before in the context of landing page bounce rate - it's possible to improve the click through rate on an email or advert by promising huge discounts and low prices... but if the landing page doesn't reflect those offers, then peopl will bounce off it alarmingly quickly.  This should show up in funnel metrics, so make sure you're analysing each step in the funnel, not just cart conversion (user added an item to cart) and order conversion (user completed a purchase).


Alternatively:  did you help some users, but deter others?  Segment your data - new vs returning, traffic source, order value...  did everybody from all segments perform exactly as they did previously, or did the new visitors benefit from the test recipe, while returning visitors found the change unhelpful?

In conclusion, if your results are showing you that your performance is flat, that's not necessarily the same as 'nothing happened'.  If it's true that nothing happened, then you've proved something different - that your visitors are more resilient (or perhaps resistant) to the type of change you're making.  You've shown that the area you've tested, and the way you've tested it, don't matter to your visitors.  Drill down as far as possible to understand if you've genuinely got flat results, and if you have, you can either test much bigger changes on this part of the site, or stop testing here completely, and move on.

The  articles in the Pitfalls of Online Optimisation and Testing series

Article 1:  Are your results really flat?
Article 2: So your results really are flat - why?  
Article 3: Discontinuous Testing

Monday, 9 February 2015

Reviewing Manchester United Performance - Real Life KPIs Part 2

As a few weeks have passed since my last review of Manchester United's performance in this year's Premier League.  An overview of the season so far reveals some interesting facts:

Southampton went to third position in mid-January, following their win at Old Trafford.  Southampton finished eighth last season, and 14th in the season before that.  This is their first season with new manager Ronald Koeman.  Perhaps some analysis on his performance is needed, another time perhaps. :-)

Southampton enjoyed their first win in 27 years in the league at Old Trafford on 11 January.  Their fifteen previous visits were two draws (1999, 2013) and thirteen wins for Manchester United. Conversely, United had won their last five at home and missed out on the chance for a ninth win in the league – which was their total for home wins in the whole of last season.

So let's take a look at Louis Van Gaal's performance, as at 9 February 2015, and compare it, as usual, with David Moyes (the 'chosen one'), Alex Ferguson (2012-13) and Alex Ferguson (1986-87, his first season).


Horizontal axis - games played
Vertical axis - cumulative points
Red - AF 2012-13
Pink - AF 1986-87
Blue - DM 2013-2014
Green - LVG 2014-15 (ongoing)

The first thing to note is that LVG has improved his performance recently, and is now back above the blue danger line (David Moyes' performance in 2013-14, which is the benchmark for 'this will get you fired').

However, LVG's performance is still a long way below the red line left by Alex Ferguson in his final season, so let's briefly investigate why.


Under LVG, Manchester United have drawn 33% of their league games this season, compared to just 13% for Alex Ferguson's 2012-13 season.  This doesn't include the goal-less draw against Cambridge United in the FA Cup, which is a great example of Man Utd not pressing home their apparent advantage (Man Utd won the rematch 3-0 at Old Trafford). Yesterday (as I write), Manchester United scraped a draw against West Ham by playing the 'long-ball game', criticised after the match by West Ham's manager, Sam Allardyce.  West Ham are currently eighth in the table, four places behind Man Utd.

Interestingly, Moyes and Van Gaal have an identical win rate of 50%.  It might be suggested that Van Gaal's issue is not converting enough draws into wins; this is a slightly better problem to have compared to Moyes' problem, which was not holding on to enough draws and subsequently losing.  In football terms, Van Gaal needs to teach his team to more effectively 'park the bus'.

Is Louis Van Gaal safe?  According to the statistics alone, yes, he is, for now.  He's securing enough draws to keep him above the David Moyes danger line, and he's achieving more wins that Alex Ferguson did in his first season.  However, his primary focus must be to start converting draws into wins.  I haven't done the full match analysis to determine if that means scoring more or holding on to the lead once he has it - perhaps that will come later.

Is Louis Van Gaal totally safe?  That depends on if the staff at Man United think that a marginal improvement on last season's performance is worth the £59.7m spent on Angel Di Maria, £29m on Ander Herrera, and £27m on Luke Shaw (plus others).  £120m for a few more draws in the season is probably not seen as good value for money.

Monday, 2 February 2015

Sum to infinity: refuelling aircraft

I recently purchased a copy of, "The Rainbow Book of BASIC Programs", a hardback book from 1984 featuring the BASIC text for a number of programs for readers to type into their home computers.  I'll forego the trip down memory lane to the time when I owned an Acorn Electron, and move directly to one of the interesting maths problems in the book.

Quoting from the book: "You are an air force general called upon to plan how to ferry emergency supplies to teams of men in trouble at various distances from your home base.  However, one of the conditions is that your planes do not have the capability to teach the destination directly, which is always outside their maximum range.  Nor are they able to land and refuel en route.  Ace pilot Rickenbacker suggests that mid-air refuelling might provide the solution.

"'Just give me a squadron of identical planes,' he tells you. 'During the flight the point will come when the entire fuel supply in one plane will be just enough to fully refill all the others.  The empty plane then drops away and the rest continue.  At the next refuelling point another plane tops up all the others leaving the full planes to continue.  The squadron keeps going in this way until only one plane remains. It uses its last drop of fuel to get to the destination with the emergency supplies.'"

So, having provided his ingenious solution, you are left with your home computer to solve the problem:  how many aircraft will it take to double or triple the maximum range of one aircraft?  And how many aircraft will it take to extend the maximum range of an airbase by six times?

To tackle this problem, I'll start with a few simple examples and look for any patterns.

Let's assume that the maximum range of the aircraft is r, and let's say that it's 100 miles.

With two aircraft: when both aircraft reach half empty (50 miles) the second aircraft refuels the prime aircraft, which then travels a further 100 miles, so the total is 150 miles (r + r/2).
Second aircraft refuels prime aircraft when it has used up 1/2 fuel

With three aircraft: the third aircraft will have enough fuel to fully refuel the others when each aircraft has used up a third of its fuel. It will transfer a third of its capacity to the second aircraft, and a third to the prime aircraft.  We will then follow on with the two-aircraft case shown above. Total distance covered is 33 miles (to the first refuelling point), then 50 miles (with two aircraft) and then the prime aircraft travels the last 100 miles alone.  Total is 183 miles, (r + r/2 + r/3).
Third aircraft refuels second and prime aircraft when all have used up 1/3 fuel
And the final example, four aircraft.  In this case, the fourth aircraft will transfer its fuel to the other three after they've all used up a quarter of their fuel.  This fourth plane will add 25 miles to the overall total, (r/4).

Fourth aircraft refuels three aircraft when all aircraft have used up 1/4 fuel

So we can see that each nth plane adds on r/n to the total distance. The first plane adds r/1, the second added on r/2, then r/3, r/4 ... r/n.



This series is known as the harmonic series, and is a well-studied mathematical series, and its properties are well-known.

The most surprising property (to me, and apparently many other people) of the sum of the harmonic series is that it doesn't converge: it doesn't get closer and closer to a fixed total. Instead, it keeps growing and growing, just more and more slowly.  If the squadron had enough aircraft, it could reach any distance necessary. Each additional aircraft adds less and less to the overall total, but the total continues to increase.

It may be counter-intuitive to find that the total doesn't reach a limt, but there are a few proofs that show this is true, the first of them discovered by Nicole d'Oresme (circa 1323-1382).

So, to answer the original question: how many aircraft will it take to double the range of one aircraft? 
1 + 1/2 + 1/3 + 1/4 = 2.083

Which means it will take four aircraft (including the prime aircraft) to double the range of the prime aircraft.

And to triple the range of one aircraft?  


1+ 1/2 +1/3 + 1/4 + 1/5 + 1/6 + 1/7 + 1/8 + 1/9 + 1/10 + 1/11 = 3.0199
Eleven aircraft!

And to extend the range to six times the initial range?
1 + 1/2 + 1/3 + 1/4 + 1/5 + 1/6 + 1/7 ... ... + 1/226 + 1/227 = 6.0044

Scarily, to increase the range of a 100 mile aircraft to 600 miles would take 227 aircraft (including the prime aircraft).  This also gets ridiculous, as the distance between subsequent refuels gets smaller and smaller, 0.00001 x r in the first few instances.

So this is clearly a theoretical exercise:  the instantaneous refuelling is tricky enough to believe, but the rapid usage of aircraft (and the 'falling away' to the ground) is just wasteful!

Other similar posts that I've written

My Life in 10 Computer Games (including a few that I played on the Acorn Electron and BBC Micro)
Sums to Infinity and Refuelling Aircraft (this one)
The BBC Micro and Sums of 2^n
Walking around the edges of a cube (based on a BBC Micro puzzle game called "L")

Tuesday, 27 January 2015

Pitfalls of Online Optimisation and Testing 1: Are your results really flat?

I've previously covered the trials of starting and maintaining an online optimisation program, and once you've reached a critical mass it seems as if the difficulties are over and it's plain sailing. Each test provides valuable business insights, yields a conversion lift (or points to a future opportunity) and you've reached a virtuous cycle of testing and learning. Except when it doesn't. There are some key pitfalls to avoid, or, having hit them, to conquer.

1. Obtaining flat results (a draw)
2. Too little meaningful data
3. Misunderstanding discrete versus continuous testing

The largest ever score draw in English football was a 5-5 draw between West Bromwich Albion and Manchester United in May 2013.  Just last weekend, the same mighty Manchester United were held to a goalless draw by Cambridge United, a team which is two divisions below them in the English league, in an FA Cup match.  Are these games the same? Are the two sides really equal? In both games, both teams performed equally, so on face value you would think they are (and perhaps they are; Manchester United are really not having a great season).  It's time to consider the underlying data to really extract an unbiased and fuller story of what happened (the Cambridge press recorded it as a great draw, one Manchester-based website saw it slightly differently).

Let's look at the recent match between Cambridge and Manchester United, borrowing a diagram from Cambridge United's official website.

One thing is immediately clear:  Cambridge didn't score any goals because they didnt' get a single shot on target.  Manchester United, on the other hand, had five shots on target but a further ten that missed - only 33% of shots were heading for the goal.  Analysis of the game would probably indicate that these were long-range shots as Cambridge kept Manchester at a 'safe' distance from their goal.  Although this game was a goalless draw, it's clear that the two sides have different issues that they need to address if they are to score in the replay next week.

Now let's look at the high-scoring draw between West Brom and Man Utd.  Which team was better and which was lucky to get a single point from the game? In each case, it would also be beneficial to analyse how each of the ten goals was scored - that's ten goals (one every nine minutes on average) which is invaluable data compared to the goalless draw.

The image on the right is borrowed from the Guardian's website, and shows the key metrics for the game (I've discussed key metrics in football matches before).  What can we conclude?

- it was a close match, with both team seeing similar levels of ball possession.

- West Brom acheived 15 shots in total, compared to just 12 for Man Utd

- If West Brom had been able to improve the quality and accuracy of their goal attempts, they may have won the game. 

- For Man Utd, the problem was not the quality of their goal attempts (they had 66% accuracy, compared to just over 50% for West Bromwich) but the quantity of them.  Their focus should be creating more shooting opportunities.


- As a secondary metric, West Brom should probably look at the causes for all those fouls.  I didn't see the game directly, but further analysis and study would indicate what happened there, and how the situation could be improved.


There is a tendency to analyse our losing tests to find out why they lost (if only so we can explain it to our managers), and with thorough planning and a solid hypothesis we should be able to identify why a test did not do well.  It's also human nature to briefly review our winners so that we can see if we can do even better in future.  But draws? They get ignored and forgotten - the test recipe had no impact and is not worth pursuing. Additionally, it didn't lose, so we don't apply the same level of scrutiny that we would if it had suffered a disastrous defeat. If wins are green and losers are red, then somehow the draws just fade to grey.  However, it shouldn't be the case.

So what should we look for in our test data?  Firstly - revisit the hypothesis.  You expected to see an overall improvement in a particular metric, but that didn't happen: was this because something happened in the pages between the test page and the success page?  For example, did you reduce a page's exit rate by apparently improving the relevance of the page's banners, only to lose all the clickers on the very next page instead - the net result is that order conversion is flat, but the story needs to be told more thoroughly. Think about how Manchester United and Cambridge United need different strategies to improve their performance in the next match.

But what if absolutely all the metrics are flat?  There's no real change in exit rate, bounce rate, click through rate, time on page... any other page metric or sales figure you care to mention?  It is quite likely, that the test you've run was not significant enough. The change in wording, colour, design or banner that you made just wasn't dramatic enough to affect your visitors' perceptions and intentions. There may still be something useful to learn from this: your visitors aren't bothered if your banners feature pictures of your product or a family photo; or a picture of a single person or a group of people... or whichever it may be.  

FA Cup matches have the advantage of a replay before there's extra time and penalties (the first may be an option for a flat test, the second sounds interesting!), so we're guaranteed a re-test, more data and a definite result in the end - something we can all look for in our tests.

The  articles in the Pitfalls of Online Optimisation and Testing series

Article 1:  Are your results really flat?
Article 2: So your results really are flat - why?  
Article 3: Discontinuous Testing

Thursday, 18 December 2014

Buy a Lego Sports Car set with Shell Petrol

Shell Petrol have a promotion on for the rest of this month, and it got my attention.  It's special promotional Lego - and Lego is one of my favourite pastimes.  The offer is this:  if you spend £30 on their special high-performance petrol, you can purchase one of the special promotional sets for £1.99.  I saw this last week, and it's been percolating in my  brain since then:  based on the price difference between the 'normal' and 'high performance' petrol, how much would you actually have to pay for the Lego?  Lego isn't cheap, and sets of this size and complexity are typically in the £4 - £5 price range, so £1.99 is a considerable saving - in theory. 



Now, in my calculations, I will assume that the mileage performance of the two petrol grades is negligible (despite any marketing messages about how good the premium petrol is).  That's a whole separate question, and one that I'd like to be able to address with an A/B test.

So:  petrol in the UK is priced per litre (the prices per gallon would be too scary to display).  Working from memory, Shell's standard unleaded petrol is approximately 119 pence per litre, while the expensive petrol is around 125 pence per litre.  Based on these assumptions, I'll complete a worked example, then dive into the algebra. 

Now, my plan here is to identify how much standard petrol I could buy with £30, to understand how much more that's going to cost me if I buy premium (as I will be doing) and what the extra cost would be if I bought the same amount of standard petrol.


If I spend £30 = 3000 pence on the standard petrol, how much petrol will I purchase?
3000 pence / 119 pence per litre = 25.21 litres of petrol

How much will it cost me to buy 25.21 litres of premium petrol?
25.21 litres x 125 pence per litre = 3151 pence

So the difference in cost would be 151 pence (£1.51).  Added to the stated cost of the Lego set (£1.99) this means that the actual total cost of the Lego set would be £1.51 + £1.99 = £3.50.  


Another view

Now, the truth is that I won't be spending the extra money on premium petrol - I will be buying £30 of premium petrol and buying less petrol.  But how much less - and what's the hidden cost of buying the premium petrol instead of the standard?

3000 pence of premium petrol at 125 pence per litre will buy me 24 litres exactly.

24 litres of standard grade petrol (at 119 pence per litre) would cost me 2856 pence, so the additional cost I'm paying is £1.44, close to the £1.51 I calculated through the other method.

Actual figures

With actual figures of 118.9 pence per litre for the standard, and 126.9 for the premium, the petrol cost difference is £1.90, and the total cost is close to the £4.00 figure I calculated through the other method.


Algebra 
Looking at this in terms of algebra:

Let E be the price per litre of the Expensive petrol, and C be the price per litre of the Cheap petrol.


 = volume of cheap petrol I would buy with 3000 pence


 

 
= difference in cost between cheap and expensive petrol.


















Application
 Now, this is all very academic, but it can be put to use with one key question:  if I think the Lego set is worth £4 (or 400 pence) then what's the maximum differential between the cheap and expensive petrol that I can accept?

If I am prepared to spend a total of 400 pence on the Lego set, then (deducting the 199p offer price) this means the maximum price difference for the petrol = 400p - 199p = 201p. 

So, if C = 119 then E = 126.8

When I re-visited the petrol station, I discovered that C = 118.9 and E = 126.9.    It's like they almost worked it out that way:  if E = 126.9 and C = 118.9 then the total cost of the Lego would be almost exactly 400p.
Did I buy the petrol?  And the Lego?



Well, yes.  But I knew I was paying more than the stated £1.99 for it :-)

Monday, 1 December 2014

Why do you read A/B testing case studies?

Case studies.  Every testing tool provider has them - in fact, most sales people have them - let's not limit this to just online optimisation.  Any good sales team will harness the power of persuasion of a good case study:  "Look at what our customers achieved by using our product."  Whether it's skin care products, shampoo, new computer hardware, or whatever it may be.  But for some reason, the online testing community really, really seems to enthuse about case studies in a way I've not seen anywhere else.

 

Salesmen will show you the amazing 197% uplift that their customers achieved through their products (and don't get me started on that one again).  But what do we do with them when we've read them?  Browsing through my Twitter feed earlier today, I noticed that Qualaroo have shared a link from a group who have decided that they will stop following A/B testing case studies:


And here's the link they refer to.


Quoting the headlines from that site, there are five problems with A/B testing case studies:

  1. What may work for one brand may not work for another.
  2. The quality of the tests varies.
  3. The impact is not necessarily sustainable over time.
  4. False assumptions and misinterpretation of result.
  5. Success bias: The experiments that do not work well usually do not get published.
I've read the article, and it leaves me with one question:  So, why do you read A/B testing case studies?  The article points out many of the issues (some of them methodical, some statistical) with A/B testing, leading with the well-known 'what may work for one brand may not work for another' (or "your mileage may vary").  I've covered this, and some of the other issues listed here before, discussing why I'm an A/B power-tool skeptic.

I came to the worrying suspicion that people (and maybe Qualaroo) read A/B testing case studies, and then implement the featured test win on their own site with no further thought.  No thought about if the test win applies to their customers and their website, or even if the test was valid.  Maybe it's just me (and it really could be just me), but when I read A/B testing case studies, I don't immediately think, 'Let's implement that on our site'.  My first thought is, 'Shall we test that on our site too?'.


And yes, there is success bias.  That's the whole point of case studies, isn't it?  "Look at the potential you could achieve with our testing tool," is significantly more compelling than, "Use our testing tool and see if you can get flat results after eight weeks' of development and testing".  I expect to see eye-grabbing headlines, and I anticipate having to trawl through the blurb and the sales copy to get to the test design, the screenshots and possibly some mention of actual results.

So let's stick with A/B tests.  Let's not be blind to the possibility that our competitors' sites run differently from ours, attract different customers and have different opportunities to improve.  Read the case studies, be skeptical, or discerning, and if the test design seems interesting, construct your own test on your own site that will satisfy your own criteria for calling a win - and keep on optimising.