Tuesday, 27 January 2015

Pitfalls of Online Optimisation

I've previously covered the trials of starting and maintaining an online optimisation program, and once you've reached a critical mass it seems as if the difficulties are over and it's plain sailing. Each test provides valuable business insights, yields a conversion lift (or points to a future opportunity) and you've reached a virtuous cycle of testing and learning. Except when it doesn't. There are some key pitfalls to avoid, or, having hit them, to conquer.

1. Obtaining flat results (a draw)
2. Too little meaningful data
3. Misunderstanding discrete versus continuous testing

The largest ever score draw in English football was a 5-5 draw between West Bromwich Albion and Manchester United in May 2013.  Just last weekend, the same mighty Manchester United were held to a goalless draw by Cambridge United, a team which is two divisions below them in the English league, in an FA Cup match.  Are these games the same? Are the two sides really equal? In both games, both teams performed equally, so on face value you would think they are (and perhaps they are; Manchester United are really not having a great season).  It's time to consider the underlying data to really extract an unbiased and fuller story of what happened (the Cambridge press recorded it as a great draw, one Manchester-based website saw it slightly differently).

Let's look at the recent match between Cambridge and Manchester United, borrowing a diagram from Cambridge United's official website.

One thing is immediately clear:  Cambridge didn't score any goals because they didnt' get a single shot on target.  Manchester United, on the other hand, had five shots on target but a further ten that missed - only 33% of shots were heading for the goal.  Analysis of the game would probably indicate that these were long-range shots as Cambridge kept Manchester at a 'safe' distance from their goal.  Although this game was a goalless draw, it's clear that the two sides have different issues that they need to address if they are to score in the replay next week.

Now let's look at the high-scoring draw between West Brom and Man Utd.  Which team was better and which was lucky to get a single point from the game? In each case, it would also be beneficial to analyse how each of the ten goals was scored - that's ten goals (one every nine minutes on average) which is invaluable data compared to the goalless draw.

The image on the right is borrowed from the Guardian's website, and shows the key metrics for the game (I've discussed key metrics in football matches before).  What can we conclude?

- it was a close match, with both team seeing similar levels of ball possession.

- West Brom acheived 15 shots in total, compared to just 12 for Man Utd

- If West Brom had been able to improve the quality and accuracy of their goal attempts, they may have won the game. 

- For Man Utd, the problem was not the quality of their goal attempts (they had 66% accuracy, compared to just over 50% for West Bromwich) but the quantity of them.  Their focus should be creating more shooting opportunities.

- As a secondary metric, West Brom should probably look at the causes for all those fouls.  I didn't see the game directly, but further analysis and study would indicate what happened there, and how the situation could be improved.

There is a tendency to analyse our losing tests to find out why they lost (if only so we can explain it to our managers), and with thorough planning and a solid hypothesis we should be able to identify why a test did not do well.  It's also human nature to briefly review our winners so that we can see if we can do even better in future.  But draws? They get ignored and forgotten - the test recipe had no impact and is not worth pursuing. Additionally, it didn't lose, so we don't apply the same level of scrutiny that we would if it had suffered a disastrous defeat. If wins are green and losers are red, then somehow the draws just fade to grey.  However, it shouldn't be the case.

So what should we look for in our test data?  Firstly - revisit the hypothesis.  You expected to see an overall improvement in a particular metric, but that didn't happen: was this because something happened in the pages between the test page and the success page?  For example, did you reduce a page's exit rate by apparently improving the relevance of the page's banners, only to lose all the clickers on the very next page instead - the net result is that order conversion is flat, but the story needs to be told more thoroughly. Think about how Manchester United and Cambridge United need different strategies to improve their performance in the next match.

But what if absolutely all the metrics are flat?  There's no real change in exit rate, bounce rate, click through rate, time on page... any other page metric or sales figure you care to mention?  It is quite likely, that the test you've run was not significant enough. The change in wording, colour, design or banner that you made just wasn't dramatic enough to affect your visitors' perceptions and intentions. There may still be something useful to learn from this: your visitors aren't bothered if your banners feature pictures of your product or a family photo; or a picture of a single person or a group of people... or whichever it may be.  

FA Cup matches have the advantage of a replay before there's extra time and penalties (the first may be an option for a flat test, the second sounds interesting!), so we're guaranteed a re-test, more data and a definite result in the end - something we can all look for in our tests.

Thursday, 18 December 2014

Buy a Lego Sports Car set with Shell Petrol

Shell Petrol have a promotion on for the rest of this month, and it got my attention.  It's special promotional Lego - and Lego is one of my favourite pastimes.  The offer is this:  if you spend £30 on their special high-performance petrol, you can purchase one of the special promotional sets for £1.99.  I saw this last week, and it's been percolating in my  brain since then:  based on the price difference between the 'normal' and 'high performance' petrol, how much would you actually have to pay for the Lego?  Lego isn't cheap, and sets of this size and complexity are typically in the £4 - £5 price range, so £1.99 is a considerable saving - in theory. 

Now, in my calculations, I will assume that the mileage performance of the two petrol grades is negligible (despite any marketing messages about how good the premium petrol is).  That's a whole separate question, and one that I'd like to be able to address with an A/B test.

So:  petrol in the UK is priced per litre (the prices per gallon would be too scary to display).  Working from memory, Shell's standard unleaded petrol is approximately 119 pence per litre, while the expensive petrol is around 125 pence per litre.  Based on these assumptions, I'll complete a worked example, then dive into the algebra. 

Now, my plan here is to identify how much standard petrol I could buy with £30, to understand how much more that's going to cost me if I buy premium (as I will be doing) and what the extra cost would be if I bought the same amount of standard petrol.

If I spend £30 = 3000 pence on the standard petrol, how much petrol will I purchase?
3000 pence / 119 pence per litre = 25.21 litres of petrol

How much will it cost me to buy 25.21 litres of premium petrol?
25.21 litres x 125 pence per litre = 3151 pence

So the difference in cost would be 151 pence (£1.51).  Added to the stated cost of the Lego set (£1.99) this means that the actual total cost of the Lego set would be £1.51 + £1.99 = £3.50.  

Another view

Now, the truth is that I won't be spending the extra money on premium petrol - I will be buying £30 of premium petrol and buying less petrol.  But how much less - and what's the hidden cost of buying the premium petrol instead of the standard?

3000 pence of premium petrol at 125 pence per litre will buy me 24 litres exactly.

24 litres of standard grade petrol (at 119 pence per litre) would cost me 2856 pence, so the additional cost I'm paying is £1.44, close to the £1.51 I calculated through the other method.

Actual figures

With actual figures of 118.9 pence per litre for the standard, and 126.9 for the premium, the petrol cost difference is £1.90, and the total cost is close to the £4.00 figure I calculated through the other method.

Looking at this in terms of algebra:

Let E be the price per litre of the Expensive petrol, and C be the price per litre of the Cheap petrol.

 = volume of cheap petrol I would buy with 3000 pence


= difference in cost between cheap and expensive petrol.

 Now, this is all very academic, but it can be put to use with one key question:  if I think the Lego set is worth £4 (or 400 pence) then what's the maximum differential between the cheap and expensive petrol that I can accept?

If I am prepared to spend a total of 400 pence on the Lego set, then (deducting the 199p offer price) this means the maximum price difference for the petrol = 400p - 199p = 201p. 

So, if C = 119 then E = 126.8

When I re-visited the petrol station, I discovered that C = 118.9 and E = 126.9.    It's like they almost worked it out that way:  if E = 126.9 and C = 118.9 then the total cost of the Lego would be almost exactly 400p.
Did I buy the petrol?  And the Lego?

Well, yes.  But I knew I was paying more than the stated £1.99 for it :-)

Monday, 1 December 2014

Why do you read A/B testing case studies?

Case studies.  Every testing tool provider has them - in fact, most sales people have them - let's not limit this to just online optimisation.  Any good sales team will harness the power of persuasion of a good case study:  "Look at what our customers achieved by using our product."  Whether it's skin care products, shampoo, new computer hardware, or whatever it may be.  But for some reason, the online testing community really, really seems to enthuse about case studies in a way I've not seen anywhere else.


Salesmen will show you the amazing 197% uplift that their customers achieved through their products (and don't get me started on that one again).  But what do we do with them when we've read them?  Browsing through my Twitter feed earlier today, I noticed that Qualaroo have shared a link from a group who have decided that they will stop following A/B testing case studies:

And here's the link they refer to.

Quoting the headlines from that site, there are five problems with A/B testing case studies:

  1. What may work for one brand may not work for another.
  2. The quality of the tests varies.
  3. The impact is not necessarily sustainable over time.
  4. False assumptions and misinterpretation of result.
  5. Success bias: The experiments that do not work well usually do not get published.
I've read the article, and it leaves me with one question:  So, why do you read A/B testing case studies?  The article points out many of the issues (some of them methodical, some statistical) with A/B testing, leading with the well-known 'what may work for one brand may not work for another' (or "your mileage may vary").  I've covered this, and some of the other issues listed here before, discussing why I'm an A/B power-tool skeptic.

I came to the worrying suspicion that people (and maybe Qualaroo) read A/B testing case studies, and then implement the featured test win on their own site with no further thought.  No thought about if the test win applies to their customers and their website, or even if the test was valid.  Maybe it's just me (and it really could be just me), but when I read A/B testing case studies, I don't immediately think, 'Let's implement that on our site'.  My first thought is, 'Shall we test that on our site too?'.

And yes, there is success bias.  That's the whole point of case studies, isn't it?  "Look at the potential you could achieve with our testing tool," is significantly more compelling than, "Use our testing tool and see if you can get flat results after eight weeks' of development and testing".  I expect to see eye-grabbing headlines, and I anticipate having to trawl through the blurb and the sales copy to get to the test design, the screenshots and possibly some mention of actual results.

So let's stick with A/B tests.  Let's not be blind to the possibility that our competitors' sites run differently from ours, attract different customers and have different opportunities to improve.  Read the case studies, be skeptical, or discerning, and if the test design seems interesting, construct your own test on your own site that will satisfy your own criteria for calling a win - and keep on optimising.

Monday, 24 November 2014

Real-Life Testing and Measuring KPIs - Manchester United

I enjoy analytics and testing, and applying them to online customer experience - using data to inform ways of improving a website.  Occasionally, it occurs to me that life would be great if we could do 'real life' testing - which is the quickest way home; which is the best meal to order; which coat should I wear today (is it going to rain)?  Instead, we have to be content with before/after analysis - make a decision, make a change, and see the difference.

One area which I also like to look at periodically is sport - in particular, football (soccer).  I've used football as an example in the past, to show the importance of picking the right KPIs.  In football, there's no A/B testing - which players should a manager select, which formation should they play in - it's all about making a decision and seeing what happens.

One of my least favourite football teams is Manchester United.  As a child, my friends all supported Liverpool, and so I did too, having no strong feeling on the subject at the time.  I soon learned, however, that as Liverpool fans, it was traditional to dislike Manchester United, due to their long-standing (and ongoing) rivalry.  So I have to confess to slight feeling of superiority whenever Manchester United perform badly.  Since the departure of their long-serving manager, Alex Ferguson, they've seen a considerable drop in performance, and much criticism has been made of his two successors, first David Moyes, and now Louis van Gaal.  David Moyes had a poor season (by Man Utd's standards) and was fired before the end of the season.  His replacement, Louis van Gaal, has not had a much better season this far.  Here's a comparison of their performance, measured in cumulative points won after each game [3 points for a win, 1 for a draw, 0 for a loss].
So, how bad is it?

Well, we can see that performance in the current season (thick green line) is lower than last season (the blue line).  Indeed, after game 10 in early November 2014, the UK media identified that this was the worst start to the season since 1986.  But since then, there's been an upturn in performance and at the time of writing, Manchester United have won their last two matches.  So perhaps things aren't too bad for Louis van Gaal.  However, the situation looks slightly different if we overlay the line for the previous season, 2012-2013, which was Sir Alex Ferguson's final season in charge.

You can see the red line indicating the stronger performance that Manchester United achieved with Sir Alex Ferguson, and how the comparison between the two newer managers pales into insignificance when you look at how they've performed against him.  There's a message here about comparing two test recipes when they've both performed badly against the control recipe, but we'll move on.

There have been some interesting results for Manchester United already this season, in particular, a defeat by Leicester City (a much smaller team who had just been promoted into the Premier League, and were generally regarded as underdogs in this match).  The 5-3 defeat by Leicester rewrote the history books.  Among other things...

- It was the first time Leicester had scored five or more goals in 14 years
- It was the first time Man Utd have ever conceded four or more goals in a Premier League game against a newly-promoted team
- It was the first time Leicester City have scored four or more goals against Manchester Utd in the league since April 1963

But apart from the anecdotal evidence, what statistical evidence is there that we could point to that would highlight the reason for the recent decline in performance?  Where should the new manager focus his efforts for improvement -based on the data (I haven't watched any of the matches in question).

Let's compare three useful metrics that show Manchester United's performance over the first 10 games of the season:  goals scored, goals conceded and clean sheets (i.e. matches where they conceded no goals).  Same colour-scheme as before:

This graph highlights (in a way I was not expecting) the clear way that Sir Alex Ferguson's successors need to improve:  their teams need to score more goals.  I know that seems obvious, but we've identified that the team's defence is adequate, conceding fewer or the same number as in Alex Ferguson's season.  However, this data is a little-oversimplified, since it also hides the 5-3 defeat I gave as an example above, where the press analysis after the match showed 'defensive frailties' in the Manchester United team.  Clearly more digging would be required to identify the true root cause  - but I'd still start with 'How can we score more goals'.

- The first ten games for each season are not against the same teams, so the 2012-13 season may have been 'easier' than the subsequent seasons (in fact, David Moyes made this complaint before the 2013-14 season had even started).
- Ten games is not a representative sample of a 38-game season, but we're not looking at the season, we're just comparing how they start.  We aren't giving ourselves the benefit of hindsight.
- I am a Liverpool fan, and at the time of writing, the Liverpool manager has had a run of four straight defeats.  Perhaps I should have analysed his performance instead.  No football manager is perfect (and I hear that Arsenal are also having a bad season).

So:  should Manchester United sack Louis van Gaal?  Well, they didn't sack David Moyes until there were only about six matches left until the end of the season; it seems harsh to fire Louis van Gaal just yet (it seems that the main reason for sacking David Moyes was actually the Manchester United share price, which also recovered after he'd been fired).

I whole-heartedly endorse making data-supported decisions, but only if you have the full context.  Here, it's hard to call (I haven't got enough data), especially since you're only looking at a before/after analysis compared to an A/B test (which would be a luxury here, and probably involve time travel).  And that, I guess, is the fun (?) of sport.

Thursday, 6 November 2014

Building Momentum in Online Testing - Key Takeaways

As I mentioned in my previous post, I was recently invited to speak at the eMetrics Summit in London, and based on discussions afterwards, the content was really useful to the attendees.  I'm glad that people were able to find it useful, and here, I'd like to share some of the key points that I raised (and some that I forgot to mention).
Image Credit: eMetrics Summit official photography
There are a large number of obstacles to building momentum with an optimisation program, but most of them can be grouped into one of these categories:

A.  Lack of development resource (HTML and JavaScript developers)
B.  Lack of management buy-in and access to resource
C.  Tests take too long to develop, run, or call a winner
D.  Tests keep losing (or, perversely, tests keep winning and the view is that "testing is completed")
E.  Lack of design resource (UXers or designers)

These issues can be addressed in a number of ways, and the general ideas I outlined were:

1.  If you need to improve your win rate, or if you don't have much development resource, re-use your existing mboxes and iterate.  You won't need to wait for IT deployments or for a developer to code new 'mboxes', you can use them again, test and learn and test again.

2.  If you need to improve the impact of your tests (i.e. your tests are producing flat results, or the wins are very small) then make more dramatic changes to your test recipes, and createI commented that generally speaking, the more differences there are between control and the test recipe, the greater the difference in performance (which may be positive or negative).  If you keep iterating and making small changes, you'll probably see smaller lifts or falls; if you take a leap into the unknown, you'll either fly or crash.

Remember not to throw out your analytics just because you're being creative - you'll need to look at the analytics carefully, as always, and any and all VOC data you have.  The key difference is that you're testing bigger changes, more changes, or both - you shouldn't be trying new ideas just because they seem good (you'll still need some reason for the recipe).

3.  If you need to get tests moving more quickly, then reduce the number of recipes per test.  More recipes means more time to develop; more time to run (less traffic per recipe per day) and more time to analyse the results afterwards.  Be selective - each recipe should address the original test hypothesis in a different way, you shouldn't need to add on recipe after recipe just because it looks like a good idea.  Also, only test on high-traffic or critical pages, where there's plenty of volume of traffic, or where it's mission-critical (for example, cart pages, or key landing pages).  As a bonus, if you work on optimising conversion or bounce rate for your PPC or display marketing traffic, you'll have an automatic champion in your online marketing department.

Extra:  If you do decide to run with a large number of recipes, then monitor the recipes' performance more frequently.  As soon as you can identify a recipe which is significantly and definitely underperforming vs control, switch it off.  This has two benefits:  a) you drive a larger share of traffic through the remaining recipes, and b) you're saving the business money because you've stopped traffic going through a low-converting (or low-performing) recipe - which was costing money.

4.  Getting management buy-in and support on an ongoing basis:  this is not easy, especially when analysts are, stereotypically, numbers-people rather than people-people. We find it easier to work with numbers than to work with people, since numbers are clear-cut and well-defined, and people can be... well... messy and unpredictable.  Brooks Bell have recently released a blog post about five ways to manage up, which I recommend.  The main recommendation is to get out there and share.  Share your winners (pleasant) and your losers (unpleasant), but also explain why you think a test is winning or losing.  This kind of discussion will lead naturally on to, "Well, it lost because this component was too big/too small/in the wrong place." and starts to inform your next test.

I also talked through my ideas on what makes a good test idea, and what makes for a bad test idea; here's the diagram I shared on 'good test ideas'.

In this diagram, the top circle defines what your customers want, based on your analysis; the lower left circle defines your coding capabilities and the lower right defines ideas that are aligned with your company brand and which are supported by your management team.

So where are the good test ideas?  You might think that they are in segment D.  In fact, these are recommendations for immediate action.  The best test ideas are close to segment D, but not actually in it; the areas around segment D are the best places - where two of the three circles intersect, but where the third is nearly aligned too.  For example; in segment F, we have ideas that the developers can produce, and which management are aligned with, but where there is a doubt about if it will help customer experience.  Here, the idea may be a new way of customising or personalising your product in your order process - upgrading the warranty or guarantee; adding a larger battery or a special waterproof coating (whatever your product may be).  This may work well on your site, but it may also be too complex.  Your customer experience data may show that users want more options for customising and configuring their purchase - but is this the best way to do it?  Let's test!

I also briefly covered bad test ideas - things that should not be tested.  There's a short list:

Don't test making improvements such as bug fixes, broken links, broken image links, spelling and grammar mistakes.  There's no point - it's a clear winner.  

Don't test fixes for historic bugs in your page templates - for example where you're integrating newer designs or elements (product videos, for example) that weren't catered for when the layout was originally built.  The alignment of the elements on the page are a little off, things don't fit or line up vertically, horizontally - these can be improved with a test, but really, this isn't fixing the main issue, which is that the page needs fixing.  The test will show the financial upside of making the fix (and this would be the only valid case for running the test) but the bottom line is that a test will only prove what you already know.
I wrapped up my keynote by mentioning the need to select your KPIs for the test, and for that, I have to confess that I borrowed from a blog post I wrote earlier this year, which was a sporting example of metrics.
Presenting the "metrics in sport" slide, Image Credit: Aurelie Pols
I'm already looking forward to the next conference, which will probably be in 2015!

Tuesday, 4 November 2014

Building momentum in your online optimisation program (eMetrics UK)

At the end of October, I spoke at eMetrics London.  I was invited by Peter O'Neill to present at the conference, and I anticipated that I would be speaking as part of a track on optimisation or testing.  However, Peter put me on the agenda with the keynote at  the start of the second day, a slot I feel very honoured to have been given.

Jim Sterne, my Web Analytics hero, presenting
Selfie: a quick last-minute practice
Peter O'Neill, eMetrics UK organiser
I thoroughly enjoyed presenting - and I'm still learning on making formal web analytics presentations (and probably will always be) - but for me the highlight of the Summit was meeting and talking with Jim Sterne, the Founding President and current Chairman of the Digital Analytics Association, and the Founder of the eMetrics Marketing Optimization Summit.  I've been following him since before Twitter and Facebook, through his email newsletter "Sterne Measures" - and, as he kindly pointed out to me when I mentioned this, "Oh, you're old!"  Jim gave a great keynote presentation on going from "Bits and Bytes to Insights" which has to be one of the clearest and most comprehensive presentations on the history and future of web analytics that I've ever heard.

My topic for the keynote was "Building momentum in your online optimisation program."  From my discussions at various other conferences, I've noted that people aren't concerned with getting an online testing program started, and overcoming the initial obstacles; many analysts are now struggling to keep it running.  I've previously blogged on getting a testing program off the ground, and this topic is more about keeping it up in the air.  While putting the final parts of the presentation together I determined not to re-use the material from my blog - as much as possible.  The emphasis in my presentation was on how to take the first few tests and move towards a critical mass as quickly as possible - where test ideas and test velocity will increase sufficiently that there will be continuous ongoing interest in your tests - winners and losers, so that you'll be able to make a significant, consistent improvement to your company's website.

I'm just getting resettled back into the routine of normal work, but I'll share the key points (including some parts I missed) from my presentation in a future blog post as soon as I can.

Monday, 13 October 2014

Queen's Gambit Declined

I recently won my first face-to-face Chess game for months.  I think it's only my third or fourth win this year, so I'm really pleased.  It wasn't a perfectly-played game, but it went well enough.

David Leese (rated 95) Kidsgrove vs Ben Lack (rated 64) Newcastle
30 September 2014, South Cheshire Shield, Board 4

As you can see, I played the Queen's Gambit, as I often do, and have discovered the move 4. Bf4 from some of the reading I've been doing.  I was very surprised by my opponent's move 4. ... g6 which seemed a little early, and took my chance to play something a little unusual myself; 5. Nb5, threatening 6. Nxc7+

Black's move 4. Bf4 significantly weakened the darker squares on his kingside, and although I couldn't exploit them immediately, I knew this was something I could work on in the future.  I envisaged that Black's response to 5. Nb5 would be ... Bd6, where I would capture, and after his recapture with his c-pawn, he would have doubled pawns.  I didn't expect his reply which was 5 ... Bb4+.

I had to retreat my knight, but I wasn't expecting him to return his bishop to f8, as he did on move 7.  He's clearly desperate to fianchetto his bishop on g7, and is going to take as many moves as necessary to make it happen.  In theory, anyway.

Another disadvantage of having all those pawns on light squares is that it severely impeded his kingside knight, a situation I was able to exploit with my kingside pawns in the start of the middlegame.  Blacks pawns are unable to support his knight on f6, and by pinning it with my bishop on g5, I was able to obtain a pawn advantage.  I was then able to construct various further threats against the knight, and drive it back to g8, cramping my opponent and preventing him from castling.

I made a blunder on move 14.  In order to develop my queen, possibly castle queenside and watch black's queen, I had just played 13. Qb3, which attacks the pawn on b7.  Black pushed this pawn, and I missed an opportunity to threaten his queen, trap it and gain material.  I should have played 14. Bd2, which would be followed with the threat of Nb5 and Nc7+ winning a rook or the queen (which would have previously been forced to a6).  As it happened, I played 14. Bd3 which looked like a natural developing move, and my opponent was able to complicate the game and escape from this trap - but not without losing material.  I was able to win his bishop (following his interesting but inaccurate sacrifice) and then start simplifying - a process my opponent seemed happy to help me with. 

I was surprised by 17. ... Na6, as I was predicting Nc6 which kept the knight away from the bishop and made it harder for me to capture the b-pawn, while attacking my pawn on e5.

Speaking of my pawn on e5, I made an unfortunate blunder on move 29, when I failed to protect it!  I was too busy wondering about how to push my a-pawn and close out my opponent's passed pawn on d5 that I missed the obvious and helpful f4 pawn push.

All in all, I am pleased with this game.  I'm pleased with the result, but also pleased that I spotted some key tactics (although I didn't fully appreciate them at the time) and that I also noticed some of my errors or misses while I was at the board and was able to address them and still develop a win.  So far this season, I have played three, won one, drawn one and lost one, so it's a better start than last season, and I've ended my losing streak of seven or eight games!