Wednesday, 11 February 2015

Pitfalls of online optimisation 2: Spot the Difference

The second pitfall in online optimisation that I would like to look at is why we obtain flat results - totally, completely flat results at all levels of the funnel.  All metrics show the same results - bounce rate, exit rate, cart additions, average order value, order conversion. There is nothing to choose between the two recipes, despite a solid hypothesis and analytics which support your idea.

The most likely cause is that the changes you made in your test recipe were just not dramatic enough.  There are different types of change you could test:
*  Visual change (the most obvious) 
*  Path change (where do you take users who click on a "Learn more" link?)
*  Interaction change (do you have a hover state? Is clicking different from hovering? How do you close a pop-up?)

Sometimes, the change could be dramatic but the problem is that it was made on an insignificant part of the site or page.  If you carried out an end-to-end customer journey through the control experience and then through the test experience, could you spot the difference?  Worse still, did you test on a page which has traffic but doesn't actively contribute to your overall sales (is its order participation virtually zero?)?
Is your hypothesis wrong? Did you think the strap line was important? Have you in fact discovered that something you thought was important is being overlooked by visitors?
Are you being too cautious - is there too much at stake and you didn't risk enough? 

Is the part of the site getting traffic? And does that traffic convert? Or is it just a traffic backwater or a pathing dead end?  It could be that you have unintentionally uncovered an area of your site which is not contributing to site perofrmance.

Do your success metrics match your hypothesis? Are you optimising for orders on your Customer Support pages? Are you trying to drive down telephone sales?
Some areas of the site are critical, and small changes have big differences. On the other hand, some parts of the site are like background noise that users filter out (which is a shame when we spend so much time and effort selecting a typeface, colour scheme and imagery which supports our brand!). We agonise over the photos we use on our sites, we select the best images and icons... And they're just hygiene factors that users barely glance at.  On the other hand, there are some parts that are critical - persuasive copy, clear calls to action, product information and specifications.  What we need to know, and can find out through our testing, is what matters and what doesn't.

Another possibility is that you made two counter-acting changes - one improved conversion, and the other worsened it, so that the net change is close to zero. For example, did you make it easier for users to compare products by making the comparison link larger, but put it higher on the page which pushed other important information on the page to a lower position, where it wasn't seen?  I've mentioned this before in the context of landing page bounce rate - it's possible to improve the click through rate on an email or advert by promising huge discounts and low prices... but if the landing page doesn't reflect those offers, then peopl will bounce off it alarmingly quickly.  This should show up in funnel metrics, so make sure you're analysing each step in the funnel, not just cart conversion (user added an item to cart) and order conversion (user completed a purchase).

Alternatively:  did you help some users, but deter others?  Segment your data - new vs returning, traffic source, order value...  did everybody from all segments perform exactly as they did previously, or did the new visitors benefit from the test recipe, while returning visitors found the change unhelpful?

In conclusion, if your results are showing you that your performance is flat, that's not necessarily the same as 'nothing happened'.  If it's true that nothing happened, then you've proved something different - that your visitors are more resilient (or perhaps resistant) to the type of change you're making.  You've shown that the area you've tested, and the way you've tested it, don't matter to your visitors.  Drill down as far as possible to understand if you've genuinely got flat results, and if you have, you can either test much bigger changes on this part of the site, or stop testing here completely, and move on.

Monday, 9 February 2015

Reviewing Manchester United Performance - Real Life KPIs Part 2

As a few weeks have passed since my last review of Manchester United's performance in this year's Premier League.  An overview of the season so far reveals some interesting facts:

Southampton went to third position in mid-January, following their win at Old Trafford.  Southampton finished eighth last season, and 14th in the season before that.  This is their first season with new manager Ronald Koeman.  Perhaps some analysis on his performance is needed, another time perhaps. :-)

Southampton enjoyed their first win in 27 years in the league at Old Trafford on 11 January.  Their fifteen previous visits were two draws (1999, 2013) and thirteen wins for Manchester United. Conversely, United had won their last five at home and missed out on the chance for a ninth win in the league – which was their total for home wins in the whole of last season.

So let's take a look at Louis Van Gaal's performance, as at 9 February 2015, and compare it, as usual, with David Moyes (the 'chosen one'), Alex Ferguson (2012-13) and Alex Ferguson (1986-87, his first season).

Horizontal axis - games played
Vertical axis - cumulative points
Red - AF 2012-13
Pink - AF 1986-87
Blue - DM 2013-2014
Green - LVG 2014-15 (ongoing)

The first thing to note is that LVG has improved his performance recently, and is now back above the blue danger line (David Moyes' performance in 2013-14, which is the benchmark for 'this will get you fired').

However, LVG's performance is still a long way below the red line left by Alex Ferguson in his final season, so let's briefly investigate why.

Under LVG, Manchester United have drawn 33% of their league games this season, compared to just 13% for Alex Ferguson's 2012-13 season.  This doesn't include the goal-less draw against Cambridge United in the FA Cup, which is a great example of Man Utd not pressing home their apparent advantage (Man Utd won the rematch 3-0 at Old Trafford). Yesterday (as I write), Manchester United scraped a draw against West Ham by playing the 'long-ball game', criticised after the match by West Ham's manager, Sam Allardyce.  West Ham are currently eighth in the table, four places behind Man Utd.

Interestingly, Moyes and Van Gaal have an identical win rate of 50%.  It might be suggested that Van Gaal's issue is not converting enough draws into wins; this is a slightly better problem to have compared to Moyes' problem, which was not holding on to enough draws and subsequently losing.  In football terms, Van Gaal needs to teach his team to more effectively 'park the bus'.

Is Louis Van Gaal safe?  According to the statistics alone, yes, he is, for now.  He's securing enough draws to keep him above the David Moyes danger line, and he's achieving more wins that Alex Ferguson did in his first season.  However, his primary focus must be to start converting draws into wins.  I haven't done the full match analysis to determine if that means scoring more or holding on to the lead once he has it - perhaps that will come later.

Is Louis Van Gaal totally safe?  That depends on if the staff at Man United think that a marginal improvement on last season's performance is worth the £59.7m spent on Angel Di Maria, £29m on Ander Herrera, and £27m on Luke Shaw (plus others).  £120m for a few more draws in the season is probably not seen as good value for money.

Monday, 2 February 2015

Sum to infinity: refuelling aircraft

I recently purchased a copy of, "The Rainbow Book of BASIC Programs", a hardback book from 1984 featuring the BASIC text for a number of programs for readers to type into their home computers.  I'll forego the trip down memory lane to the time when I owned an Acorn Electron, and move directly to one of the interesting maths problems in the book.

Quoting from the book: "You are an air force general called upon to plan how to ferry emergency supplies to teams of men in trouble at various distances from your home base.  However, one of the conditions is that your planes do not have the capability to teach the destination directly, which is always outside their maximum range.  Nor are they able to land and refuel en route.  Ace pilot Rickenbacker suggests that mid-air refuelling might provide the solution.

"'Just give me a squadron of identical planes,' he tells you. 'During the flight the point will come when the entire fuel supply in one plane will be just enough to fully refill all the others.  The empty plane then drops away and the rest continue.  At the next refuelling point another plane tops up all the others leaving the full planes to continue.  The squadron keeps going in this way until only one plane remains. It uses its last drop of fuel to get to the destination with the emergency supplies.'"

So, having provided his ingenious solution, you are left with your home computer to solve the problem:  how many aircraft will it take to double or triple the maximum range of one aircraft?  And how many aircraft will it take to extend the maximum range of an airbase by six times?

To tackle this problem, I'll start with a few simple examples and look for any patterns.

Let's assume that the maximum range of the aircraft is r, and let's say that it's 100 miles.

With two aircraft: when both aircraft reach half empty (50 miles) the second aircraft refuels the prime aircraft, which then travels a further 100 miles, so the total is 150 miles (r + r/2).
Second aircraft refuels prime aircraft when it has used up 1/2 fuel

With three aircraft: the third aircraft will have enough fuel to fully refuel the others when each aircraft has used up a third of its fuel. It will transfer a third of its capacity to the second aircraft, and a third to the prime aircraft.  We will then follow on with the two-aircraft case shown above. Total distance covered is 33 miles (to the first refuelling point), then 50 miles (with two aircraft) and then the prime aircraft travels the last 100 miles alone.  Total is 183 miles, (r + r/2 + r/3).
Third aircraft refuels second and prime aircraft when all have used up 1/3 fuel
And the final example, four aircraft.  In this case, the fourth aircraft will transfer its fuel to the other three after they've all used up a quarter of their fuel.  This fourth plane will add 25 miles to the overall total, (r/4).

Fourth aircraft refuels three aircraft when all aircraft have used up 1/4 fuel

So we can see that each nth plane adds on r/n to the total distance. The first plane adds r/1, the second added on r/2, then r/3, r/4 ... r/n.

This series is known as the harmonic series, and is a well-studied mathematical series, and its properties are well-known.

The most surprising property (to me, and apparently many other people) of the sum of the harmonic series is that it doesn't converge: it doesn't get closer and closer to a fixed total. Instead, it keeps growing and growing, just more and more slowly.  If the squadron had enough aircraft, it could reach any distance necessary. Each additional aircraft adds less and less to the overall total, but the total continues to increase.

It may be counter-intuitive to find that the total doesn't reach a limt, but there are a few proofs that show this is true, the first of them discovered by Nicole d'Oresme (circa 1323-1382).

So, to answer the original question: how many aircraft will it take to double the range of one aircraft? 
1 + 1/2 + 1/3 + 1/4 = 2.083

Which means it will take four aircraft (including the prime aircraft) to double the range of the prime aircraft.

And to triple the range of one aircraft?  

1+ 1/2 +1/3 + 1/4 + 1/5 + 1/6 + 1/7 + 1/8 + 1/9 + 1/10 + 1/11 = 3.0199
Eleven aircraft!

And to extend the range to six times the initial range?
1 + 1/2 + 1/3 + 1/4 + 1/5 + 1/6 + 1/7 ... ... + 1/226 + 1/227 = 6.0044

Scarily, to increase the range of a 100 mile aircraft to 600 miles would take 227 aircraft (including the prime aircraft).  This also gets ridiculous, as the distance between subsequent refuels gets smaller and smaller, 0.00001 x r in the first few instances.

So this is clearly a theoretical exercise:  the instantaneous refuelling is tricky enough to believe, but the rapid usage of aircraft (and the 'falling away' to the ground) is just wasteful!

Tuesday, 27 January 2015

Pitfalls of Online Optimisation

I've previously covered the trials of starting and maintaining an online optimisation program, and once you've reached a critical mass it seems as if the difficulties are over and it's plain sailing. Each test provides valuable business insights, yields a conversion lift (or points to a future opportunity) and you've reached a virtuous cycle of testing and learning. Except when it doesn't. There are some key pitfalls to avoid, or, having hit them, to conquer.

1. Obtaining flat results (a draw)
2. Too little meaningful data
3. Misunderstanding discrete versus continuous testing

The largest ever score draw in English football was a 5-5 draw between West Bromwich Albion and Manchester United in May 2013.  Just last weekend, the same mighty Manchester United were held to a goalless draw by Cambridge United, a team which is two divisions below them in the English league, in an FA Cup match.  Are these games the same? Are the two sides really equal? In both games, both teams performed equally, so on face value you would think they are (and perhaps they are; Manchester United are really not having a great season).  It's time to consider the underlying data to really extract an unbiased and fuller story of what happened (the Cambridge press recorded it as a great draw, one Manchester-based website saw it slightly differently).

Let's look at the recent match between Cambridge and Manchester United, borrowing a diagram from Cambridge United's official website.

One thing is immediately clear:  Cambridge didn't score any goals because they didnt' get a single shot on target.  Manchester United, on the other hand, had five shots on target but a further ten that missed - only 33% of shots were heading for the goal.  Analysis of the game would probably indicate that these were long-range shots as Cambridge kept Manchester at a 'safe' distance from their goal.  Although this game was a goalless draw, it's clear that the two sides have different issues that they need to address if they are to score in the replay next week.

Now let's look at the high-scoring draw between West Brom and Man Utd.  Which team was better and which was lucky to get a single point from the game? In each case, it would also be beneficial to analyse how each of the ten goals was scored - that's ten goals (one every nine minutes on average) which is invaluable data compared to the goalless draw.

The image on the right is borrowed from the Guardian's website, and shows the key metrics for the game (I've discussed key metrics in football matches before).  What can we conclude?

- it was a close match, with both team seeing similar levels of ball possession.

- West Brom acheived 15 shots in total, compared to just 12 for Man Utd

- If West Brom had been able to improve the quality and accuracy of their goal attempts, they may have won the game. 

- For Man Utd, the problem was not the quality of their goal attempts (they had 66% accuracy, compared to just over 50% for West Bromwich) but the quantity of them.  Their focus should be creating more shooting opportunities.

- As a secondary metric, West Brom should probably look at the causes for all those fouls.  I didn't see the game directly, but further analysis and study would indicate what happened there, and how the situation could be improved.

There is a tendency to analyse our losing tests to find out why they lost (if only so we can explain it to our managers), and with thorough planning and a solid hypothesis we should be able to identify why a test did not do well.  It's also human nature to briefly review our winners so that we can see if we can do even better in future.  But draws? They get ignored and forgotten - the test recipe had no impact and is not worth pursuing. Additionally, it didn't lose, so we don't apply the same level of scrutiny that we would if it had suffered a disastrous defeat. If wins are green and losers are red, then somehow the draws just fade to grey.  However, it shouldn't be the case.

So what should we look for in our test data?  Firstly - revisit the hypothesis.  You expected to see an overall improvement in a particular metric, but that didn't happen: was this because something happened in the pages between the test page and the success page?  For example, did you reduce a page's exit rate by apparently improving the relevance of the page's banners, only to lose all the clickers on the very next page instead - the net result is that order conversion is flat, but the story needs to be told more thoroughly. Think about how Manchester United and Cambridge United need different strategies to improve their performance in the next match.

But what if absolutely all the metrics are flat?  There's no real change in exit rate, bounce rate, click through rate, time on page... any other page metric or sales figure you care to mention?  It is quite likely, that the test you've run was not significant enough. The change in wording, colour, design or banner that you made just wasn't dramatic enough to affect your visitors' perceptions and intentions. There may still be something useful to learn from this: your visitors aren't bothered if your banners feature pictures of your product or a family photo; or a picture of a single person or a group of people... or whichever it may be.  

FA Cup matches have the advantage of a replay before there's extra time and penalties (the first may be an option for a flat test, the second sounds interesting!), so we're guaranteed a re-test, more data and a definite result in the end - something we can all look for in our tests.

Thursday, 18 December 2014

Buy a Lego Sports Car set with Shell Petrol

Shell Petrol have a promotion on for the rest of this month, and it got my attention.  It's special promotional Lego - and Lego is one of my favourite pastimes.  The offer is this:  if you spend £30 on their special high-performance petrol, you can purchase one of the special promotional sets for £1.99.  I saw this last week, and it's been percolating in my  brain since then:  based on the price difference between the 'normal' and 'high performance' petrol, how much would you actually have to pay for the Lego?  Lego isn't cheap, and sets of this size and complexity are typically in the £4 - £5 price range, so £1.99 is a considerable saving - in theory. 

Now, in my calculations, I will assume that the mileage performance of the two petrol grades is negligible (despite any marketing messages about how good the premium petrol is).  That's a whole separate question, and one that I'd like to be able to address with an A/B test.

So:  petrol in the UK is priced per litre (the prices per gallon would be too scary to display).  Working from memory, Shell's standard unleaded petrol is approximately 119 pence per litre, while the expensive petrol is around 125 pence per litre.  Based on these assumptions, I'll complete a worked example, then dive into the algebra. 

Now, my plan here is to identify how much standard petrol I could buy with £30, to understand how much more that's going to cost me if I buy premium (as I will be doing) and what the extra cost would be if I bought the same amount of standard petrol.

If I spend £30 = 3000 pence on the standard petrol, how much petrol will I purchase?
3000 pence / 119 pence per litre = 25.21 litres of petrol

How much will it cost me to buy 25.21 litres of premium petrol?
25.21 litres x 125 pence per litre = 3151 pence

So the difference in cost would be 151 pence (£1.51).  Added to the stated cost of the Lego set (£1.99) this means that the actual total cost of the Lego set would be £1.51 + £1.99 = £3.50.  

Another view

Now, the truth is that I won't be spending the extra money on premium petrol - I will be buying £30 of premium petrol and buying less petrol.  But how much less - and what's the hidden cost of buying the premium petrol instead of the standard?

3000 pence of premium petrol at 125 pence per litre will buy me 24 litres exactly.

24 litres of standard grade petrol (at 119 pence per litre) would cost me 2856 pence, so the additional cost I'm paying is £1.44, close to the £1.51 I calculated through the other method.

Actual figures

With actual figures of 118.9 pence per litre for the standard, and 126.9 for the premium, the petrol cost difference is £1.90, and the total cost is close to the £4.00 figure I calculated through the other method.

Looking at this in terms of algebra:

Let E be the price per litre of the Expensive petrol, and C be the price per litre of the Cheap petrol.

 = volume of cheap petrol I would buy with 3000 pence


= difference in cost between cheap and expensive petrol.

 Now, this is all very academic, but it can be put to use with one key question:  if I think the Lego set is worth £4 (or 400 pence) then what's the maximum differential between the cheap and expensive petrol that I can accept?

If I am prepared to spend a total of 400 pence on the Lego set, then (deducting the 199p offer price) this means the maximum price difference for the petrol = 400p - 199p = 201p. 

So, if C = 119 then E = 126.8

When I re-visited the petrol station, I discovered that C = 118.9 and E = 126.9.    It's like they almost worked it out that way:  if E = 126.9 and C = 118.9 then the total cost of the Lego would be almost exactly 400p.
Did I buy the petrol?  And the Lego?

Well, yes.  But I knew I was paying more than the stated £1.99 for it :-)

Monday, 1 December 2014

Why do you read A/B testing case studies?

Case studies.  Every testing tool provider has them - in fact, most sales people have them - let's not limit this to just online optimisation.  Any good sales team will harness the power of persuasion of a good case study:  "Look at what our customers achieved by using our product."  Whether it's skin care products, shampoo, new computer hardware, or whatever it may be.  But for some reason, the online testing community really, really seems to enthuse about case studies in a way I've not seen anywhere else.


Salesmen will show you the amazing 197% uplift that their customers achieved through their products (and don't get me started on that one again).  But what do we do with them when we've read them?  Browsing through my Twitter feed earlier today, I noticed that Qualaroo have shared a link from a group who have decided that they will stop following A/B testing case studies:

And here's the link they refer to.

Quoting the headlines from that site, there are five problems with A/B testing case studies:

  1. What may work for one brand may not work for another.
  2. The quality of the tests varies.
  3. The impact is not necessarily sustainable over time.
  4. False assumptions and misinterpretation of result.
  5. Success bias: The experiments that do not work well usually do not get published.
I've read the article, and it leaves me with one question:  So, why do you read A/B testing case studies?  The article points out many of the issues (some of them methodical, some statistical) with A/B testing, leading with the well-known 'what may work for one brand may not work for another' (or "your mileage may vary").  I've covered this, and some of the other issues listed here before, discussing why I'm an A/B power-tool skeptic.

I came to the worrying suspicion that people (and maybe Qualaroo) read A/B testing case studies, and then implement the featured test win on their own site with no further thought.  No thought about if the test win applies to their customers and their website, or even if the test was valid.  Maybe it's just me (and it really could be just me), but when I read A/B testing case studies, I don't immediately think, 'Let's implement that on our site'.  My first thought is, 'Shall we test that on our site too?'.

And yes, there is success bias.  That's the whole point of case studies, isn't it?  "Look at the potential you could achieve with our testing tool," is significantly more compelling than, "Use our testing tool and see if you can get flat results after eight weeks' of development and testing".  I expect to see eye-grabbing headlines, and I anticipate having to trawl through the blurb and the sales copy to get to the test design, the screenshots and possibly some mention of actual results.

So let's stick with A/B tests.  Let's not be blind to the possibility that our competitors' sites run differently from ours, attract different customers and have different opportunities to improve.  Read the case studies, be skeptical, or discerning, and if the test design seems interesting, construct your own test on your own site that will satisfy your own criteria for calling a win - and keep on optimising.

Monday, 24 November 2014

Real-Life Testing and Measuring KPIs - Manchester United

I enjoy analytics and testing, and applying them to online customer experience - using data to inform ways of improving a website.  Occasionally, it occurs to me that life would be great if we could do 'real life' testing - which is the quickest way home; which is the best meal to order; which coat should I wear today (is it going to rain)?  Instead, we have to be content with before/after analysis - make a decision, make a change, and see the difference.

One area which I also like to look at periodically is sport - in particular, football (soccer).  I've used football as an example in the past, to show the importance of picking the right KPIs.  In football, there's no A/B testing - which players should a manager select, which formation should they play in - it's all about making a decision and seeing what happens.

One of my least favourite football teams is Manchester United.  As a child, my friends all supported Liverpool, and so I did too, having no strong feeling on the subject at the time.  I soon learned, however, that as Liverpool fans, it was traditional to dislike Manchester United, due to their long-standing (and ongoing) rivalry.  So I have to confess to slight feeling of superiority whenever Manchester United perform badly.  Since the departure of their long-serving manager, Alex Ferguson, they've seen a considerable drop in performance, and much criticism has been made of his two successors, first David Moyes, and now Louis van Gaal.  David Moyes had a poor season (by Man Utd's standards) and was fired before the end of the season.  His replacement, Louis van Gaal, has not had a much better season this far.  Here's a comparison of their performance, measured in cumulative points won after each game [3 points for a win, 1 for a draw, 0 for a loss].
So, how bad is it?

Well, we can see that performance in the current season (thick green line) is lower than last season (the blue line).  Indeed, after game 10 in early November 2014, the UK media identified that this was the worst start to the season since 1986.  But since then, there's been an upturn in performance and at the time of writing, Manchester United have won their last two matches.  So perhaps things aren't too bad for Louis van Gaal.  However, the situation looks slightly different if we overlay the line for the previous season, 2012-2013, which was Sir Alex Ferguson's final season in charge.

You can see the red line indicating the stronger performance that Manchester United achieved with Sir Alex Ferguson, and how the comparison between the two newer managers pales into insignificance when you look at how they've performed against him.  There's a message here about comparing two test recipes when they've both performed badly against the control recipe, but we'll move on.

There have been some interesting results for Manchester United already this season, in particular, a defeat by Leicester City (a much smaller team who had just been promoted into the Premier League, and were generally regarded as underdogs in this match).  The 5-3 defeat by Leicester rewrote the history books.  Among other things...

- It was the first time Leicester had scored five or more goals in 14 years
- It was the first time Man Utd have ever conceded four or more goals in a Premier League game against a newly-promoted team
- It was the first time Leicester City have scored four or more goals against Manchester Utd in the league since April 1963

But apart from the anecdotal evidence, what statistical evidence is there that we could point to that would highlight the reason for the recent decline in performance?  Where should the new manager focus his efforts for improvement -based on the data (I haven't watched any of the matches in question).

Let's compare three useful metrics that show Manchester United's performance over the first 10 games of the season:  goals scored, goals conceded and clean sheets (i.e. matches where they conceded no goals).  Same colour-scheme as before:

This graph highlights (in a way I was not expecting) the clear way that Sir Alex Ferguson's successors need to improve:  their teams need to score more goals.  I know that seems obvious, but we've identified that the team's defence is adequate, conceding fewer or the same number as in Alex Ferguson's season.  However, this data is a little-oversimplified, since it also hides the 5-3 defeat I gave as an example above, where the press analysis after the match showed 'defensive frailties' in the Manchester United team.  Clearly more digging would be required to identify the true root cause  - but I'd still start with 'How can we score more goals'.

- The first ten games for each season are not against the same teams, so the 2012-13 season may have been 'easier' than the subsequent seasons (in fact, David Moyes made this complaint before the 2013-14 season had even started).
- Ten games is not a representative sample of a 38-game season, but we're not looking at the season, we're just comparing how they start.  We aren't giving ourselves the benefit of hindsight.
- I am a Liverpool fan, and at the time of writing, the Liverpool manager has had a run of four straight defeats.  Perhaps I should have analysed his performance instead.  No football manager is perfect (and I hear that Arsenal are also having a bad season).

So:  should Manchester United sack Louis van Gaal?  Well, they didn't sack David Moyes until there were only about six matches left until the end of the season; it seems harsh to fire Louis van Gaal just yet (it seems that the main reason for sacking David Moyes was actually the Manchester United share price, which also recovered after he'd been fired).

I whole-heartedly endorse making data-supported decisions, but only if you have the full context.  Here, it's hard to call (I haven't got enough data), especially since you're only looking at a before/after analysis compared to an A/B test (which would be a luxury here, and probably involve time travel).  And that, I guess, is the fun (?) of sport.