Web Optimisation, Maths and Puzzles

Monday, 24 November 2014

Real-Life Testing and Measuring KPIs - Manchester United

I enjoy analytics and testing, and applying them to online customer experience - using data to inform ways of improving a website. Occasionally, it occurs to me that life would be great if we could do 'real life' testing - which is the quickest way home; which is the best meal to order; which coat should I wear today (is it going to rain)? Instead, we have to be content with before/after analysis - make a decision, make a change, and see the difference.

One area which I also like to look at periodically is sport - in particular, football (soccer). I've used football as an example in the past, to show the importance of picking the right KPIs. In football, there's no A/B testing - which players should a manager select, which formation should they play in - it's all about making a decision and seeing what happens.

One of my least favourite football teams is Manchester United. As a child, my friends all supported Liverpool, and so I did too, having no strong feeling on the subject at the time. I soon learned, however, that as Liverpool fans, it was traditional to dislike Manchester United, due to their long-standing (and ongoing) rivalry. So I have to confess to slight feeling of superiority whenever Manchester United perform badly. Since the departure of their long-serving manager, Alex Ferguson, they've seen a considerable drop in performance, and much criticism has been made of his two successors, first David Moyes, and now Louis van Gaal. David Moyes had a poor season (by Man Utd's standards) and was fired before the end of the season. His replacement, Louis van Gaal, has not had a much better season this far. Here's a comparison of their performance, measured in cumulative points won after each game [3 points for a win, 1 for a draw, 0 for a loss].

So, how bad is it?

Well, we can see that performance in the current season (thick green line) is lower than last season (the blue line). Indeed, after game 10 in early November 2014, the UK media identified that this was the worst start to the season since 1986. But since then, there's been an upturn in performance and at the time of writing, Manchester United have won their last two matches. So perhaps things aren't too bad for Louis van Gaal. However, the situation looks slightly different if we overlay the line for the previous season, 2012-2013, which was Sir Alex Ferguson's final season in charge.

You can see the red line indicating the stronger performance that Manchester United achieved with Sir Alex Ferguson, and how the comparison between the two newer managers pales into insignificance when you look at how they've performed against him. There's a message here about comparing two test recipes when they've both performed badly against the control recipe, but we'll move on.

There have been some interesting results for Manchester United already this season, in particular, a defeat by Leicester City (a much smaller team who had just been promoted into the Premier League, and were generally regarded as underdogs in this match). The 5-3 defeat by Leicester rewrote the history books. Among other things...

- It was the first time Leicester had scored five or more goals in 14 years- It was the first time Man Utd have ever conceded four or more goals in a Premier League game against a newly-promoted team
- It was the first time Leicester City have scored four or more goals against Manchester Utd in the league since April 1963

But apart from the anecdotal evidence, what statistical evidence is there that we could point to that would highlight the reason for the recent decline in performance? Where should the new manager focus his efforts for improvement -based on the data (I haven't watched any of the matches in question).

Let's compare three useful metrics that show Manchester United's performance over the first 10 games of the season: goals scored, goals conceded and clean sheets (i.e. matches where they conceded no goals). Same colour-scheme as before:

This graph highlights (in a way I was not expecting) the clear way that Sir Alex Ferguson's successors need to improve: their teams need to score more goals. I know that seems obvious, but we've identified that the team's defence is adequate, conceding fewer or the same number as in Alex Ferguson's season. However, this data is a little-oversimplified, since it also hides the 5-3 defeat I gave as an example above, where the press analysis after the match showed 'defensive frailties' in the Manchester United team. Clearly more digging would be required to identify the true root cause - but I'd still start with 'How can we score more goals'.

Disclaimers:
- The first ten games for each season are not against the same teams, so the 2012-13 season may have been 'easier' than the subsequent seasons (in fact, David Moyes made this complaint before the 2013-14 season had even started).
- Ten games is not a representative sample of a 38-game season, but we're not looking at the season, we're just comparing how they start. We aren't giving ourselves the benefit of hindsight.
- I am a Liverpool fan, and at the time of writing, the Liverpool manager has had a run of four straight defeats. Perhaps I should have analysed his performance instead. No football manager is perfect (and I hear that Arsenal are also having a bad season).

So: should Manchester United sack Louis van Gaal? Well, they didn't sack David Moyes until there were only about six matches left until the end of the season; it seems harsh to fire Louis van Gaal just yet (it seems that the main reason for sacking David Moyes was actually the Manchester United share price, which also recovered after he'd been fired). I shall keep on reviewing Manchester United's performance and see how the team performs, and how the share price tracks it.

I whole-heartedly endorse making data-supported decisions, but only if you have the full context. Here, it's hard to call (I haven't got enough data), especially since you're only looking at a before/after analysis compared to an A/B test (which would be a luxury here, and probably involve time travel). And that, I guess, is the fun (?) of sport.

Thursday, 6 November 2014

Building Momentum in Online Testing - Key Takeaways

As I mentioned in my previous post, I was recently invited to speak at the eMetrics Summit in London, and based on discussions afterwards, the content was really useful to the attendees. I'm glad that people were able to find it useful, and here, I'd like to share some of the key points that I raised (and some that I forgot to mention).

Image Credit: eMetrics Summit official photography

There are a large number of obstacles to building momentum with an optimisation program, but most of them can be grouped into one of these categories:

A. Lack of development resource (HTML and JavaScript developers)
B. Lack of management buy-in and access to resource
C. Tests take too long to develop, run, or call a winner
D. Tests keep losing (or, perversely, tests keep winning and the view is that "testing is completed")
E. Lack of design resource (UXers or designers)

These issues can be addressed in a number of ways, and the general ideas I outlined were:

1. If you need to improve your win rate, or if you don't have much development resource, re-use your existing mboxes and iterate. You won't need to wait for IT deployments or for a developer to code new 'mboxes', you can use them again, test and learn and test again.

2. If you need to improve the impact of your tests (i.e. your tests are producing flat results, or the wins are very small) then make more dramatic changes to your test recipes, and create. I commented that generally speaking, the more differences there are between control and the test recipe, the greater the difference in performance (which may be positive or negative). If you keep iterating and making small changes, you'll probably see smaller lifts or falls; if you take a leap into the unknown, you'll either fly or crash.

Remember not to throw out your analytics just because you're being creative - you'll need to look at the analytics carefully, as always, and any and all VOC data you have. The key difference is that you're testing bigger changes, more changes, or both - you shouldn't be trying new ideas just because they seem good (you'll still need some reason for the recipe).

3. If you need to get tests moving more quickly, then reduce the number of recipes per test. More recipes means more time to develop; more time to run (less traffic per recipe per day) and more time to analyse the results afterwards. Be selective - each recipe should address the original test hypothesis in a different way, you shouldn't need to add on recipe after recipe just because it looks like a good idea. Also, only test on high-traffic or critical pages, where there's plenty of volume of traffic, or where it's mission-critical (for example, cart pages, or key landing pages). As a bonus, if you work on optimising conversion or bounce rate for your PPC or display marketing traffic, you'll have an automatic champion in your online marketing department.

Extra: If you do decide to run with a large number of recipes, then monitor the recipes' performance more frequently. As soon as you can identify a recipe which is significantly and definitely underperforming vs control, switch it off. This has two benefits: a) you drive a larger share of traffic through the remaining recipes, and b) you're saving the business money because you've stopped traffic going through a low-converting (or low-performing) recipe - which was costing money.

4. Getting management buy-in and support on an ongoing basis: this is not easy, especially when analysts are, stereotypically, numbers-people rather than people-people. We find it easier to work with numbers than to work with people, since numbers are clear-cut and well-defined, and people can be... well... messy and unpredictable. Brooks Bell have recently released a blog post about five ways to manage up, which I recommend. The main recommendation is to get out there and share. Share your winners (pleasant) and your losers (unpleasant), but also explain why you think a test is winning or losing. This kind of discussion will lead naturally on to, "Well, it lost because this component was too big/too small/in the wrong place." and starts to inform your next test.

I also talked through my ideas on what makes a good test idea, and what makes for a bad test idea; here's the diagram I shared on 'good test ideas'.

In this diagram, the top circle defines what your customers want, based on your analysis; the lower left circle defines your coding capabilities and the lower right defines ideas that are aligned with your company brand and which are supported by your management team.

So where are the good test ideas? You might think that they are in segment D. In fact, these are recommendations for immediate action. The best test ideas are close to segment D, but not actually in it; the areas around segment D are the best places - where two of the three circles intersect, but where the third is nearly aligned too. For example; in segment F, we have ideas that the developers can produce, and which management are aligned with, but where there is a doubt about if it will help customer experience. Here, the idea may be a new way of customising or personalising your product in your order process - upgrading the warranty or guarantee; adding a larger battery or a special waterproof coating (whatever your product may be). This may work well on your site, but it may also be too complex. Your customer experience data may show that users want more options for customising and configuring their purchase - but is this the best way to do it? Let's test!

I also briefly covered bad test ideas - things that should not be tested. There's a short list:

Don't test making improvements such as bug fixes, broken links, broken image links, spelling and grammar mistakes. There's no point - it's a clear winner.

Don't test fixes for historic bugs in your page templates - for example where you're integrating newer designs or elements (product videos, for example) that weren't catered for when the layout was originally built. The alignment of the elements on the page are a little off, things don't fit or line up vertically, horizontally - these can be improved with a test, but really, this isn't fixing the main issue, which is that the page needs fixing. The test will show the financial upside of making the fix (and this would be the only valid case for running the test) but the bottom line is that a test will only prove what you already know.

I wrapped up my keynote by mentioning the need to select your KPIs for the test, and for that, I have to confess that I borrowed from a blog post I wrote earlier this year, which was a sporting example of metrics.

Presenting the "metrics in sport" slide, Image Credit: Aurelie Pols

I'm already looking forward to the next conference, which will probably be in 2015!

Tuesday, 4 November 2014

Building momentum in your online optimisation program (eMetrics UK)

At the end of October, I spoke at eMetrics London. I was invited by Peter O'Neill to present at the conference, and I anticipated that I would be speaking as part of a track on optimisation or testing. However, Peter put me on the agenda with the keynote at the start of the second day, a slot I feel very honoured to have been given.

Jim Sterne, my Web Analytics hero, presenting

Selfie: a quick last-minute practice

Peter O'Neill, eMetrics UK organiser

I thoroughly enjoyed presenting - and I'm still learning on making formal web analytics presentations (and probably will always be) - but for me the highlight of the Summit was meeting and talking with Jim Sterne, the Founding President and current Chairman of the Digital Analytics Association, and the Founder of the eMetrics Marketing Optimization Summit. I've been following him since before Twitter and Facebook, through his email newsletter "Sterne Measures" - and, as he kindly pointed out to me when I mentioned this, "Oh, you're old!" Jim gave a great keynote presentation on going from "Bits and Bytes to Insights" which has to be one of the clearest and most comprehensive presentations on the history and future of web analytics that I've ever heard.

My topic for the keynote was "Building momentum in your online optimisation program." From my discussions at various other conferences, I've noted that people aren't concerned with getting an online testing program started, and overcoming the initial obstacles; many analysts are now struggling to keep it running. I've previously blogged on getting a testing program off the ground, and this topic is more about keeping it up in the air.

While putting the final parts of the presentation together I determined not to re-use the material from my blog - as much as possible. The emphasis in my presentation was on how to take the first few tests and move towards a critical mass as quickly as possible - where test ideas and test velocity will increase sufficiently that there will be continuous ongoing interest in your tests - winners and losers, so that you'll be able to make a significant, consistent improvement to your company's website.

I'm just getting resettled back into the routine of normal work, but I'll share the key points (including some parts I missed) from my presentation in a future blog post as soon as I can.

Monday, 13 October 2014

Queen's Gambit Declined

I recently won my first face-to-face Chess game for months. I think it's only my third or fourth win this year, so I'm really pleased. It wasn't a perfectly-played game, but it went well enough.

David Leese (rated 95) Kidsgrove vs Ben Lack (rated 64) Newcastle
30 September 2014, South Cheshire Shield, Board 4

As you can see, I played the Queen's Gambit, as I often do, and have discovered the move 4. Bf4 from some of the reading I've been doing. I was very surprised by my opponent's move 4. ... g6 which seemed a little early, and took my chance to play something a little unusual myself; 5. Nb5, threatening 6. Nxc7+

Black's move 4. Bf4 significantly weakened the darker squares on his kingside, and although I couldn't exploit them immediately, I knew this was something I could work on in the future. I envisaged that Black's response to 5. Nb5 would be ... Bd6, where I would capture, and after his recapture with his c-pawn, he would have doubled pawns. I didn't expect his reply which was 5 ... Bb4+.

I had to retreat my knight, but I wasn't expecting him to return his bishop to f8, as he did on move 7. He's clearly desperate to fianchetto his bishop on g7, and is going to take as many moves as necessary to make it happen. In theory, anyway.

Another disadvantage of having all those pawns on light squares is that it severely impeded his kingside knight, a situation I was able to exploit with my kingside pawns in the start of the middlegame. Blacks pawns are unable to support his knight on f6, and by pinning it with my bishop on g5, I was able to obtain a pawn advantage. I was then able to construct various further threats against the knight, and drive it back to g8, cramping my opponent and preventing him from castling.

I made a blunder on move 14. In order to develop my queen, possibly castle queenside and watch black's queen, I had just played 13. Qb3, which attacks the pawn on b7. Black pushed this pawn, and I missed an opportunity to threaten his queen, trap it and gain material. I should have played 14. Bd2, which would be followed with the threat of Nb5 and Nc7+ winning a rook or the queen (which would have previously been forced to a6). As it happened, I played 14. Bd3 which looked like a natural developing move, and my opponent was able to complicate the game and escape from this trap - but not without losing material. I was able to win his bishop (following his interesting but inaccurate sacrifice) and then start simplifying - a process my opponent seemed happy to help me with.

I was surprised by 17. ... Na6, as I was predicting Nc6 which kept the knight away from the bishop and made it harder for me to capture the b-pawn, while attacking my pawn on e5.

Speaking of my pawn on e5, I made an unfortunate blunder on move 29, when I failed to protect it! I was too busy wondering about how to push my a-pawn and close out my opponent's passed pawn on d5 that I missed the obvious and helpful f4 pawn push.

All in all, I am pleased with this game. I'm pleased with the result, but also pleased that I spotted some key tactics (although I didn't fully appreciate them at the time) and that I also noticed some of my errors or misses while I was at the board and was able to address them and still develop a win. So far this season, I have played three, won one, drawn one and lost one, so it's a better start than last season, and I've ended my losing streak of seven or eight games!

Wednesday, 10 September 2014

How to set up and analyse a multi-variate test

I've written at length about multi-variate tests. I've discussed barriers, complexity and design, and each time, I've concluded by saying that I would write an article about how to analyse the results from a multi variate test. This is that article.

I'm going to use the example I set up last time: testing the components of a banner to optimise its effectiveness. The success metric has been decided and it's click-through rate (for the sake of argument).

There are three components that are going to be tested:
- should the picture in the banner be a man or a woman?
- should the text in the banner say "On Sale!" or "Buy now!"
- should the text be black or red?

Here are a few example recipes from my previous post on MVT.

Recipe 1	Recipe 2
Recipe 3	Recipe 4

Recipe selection and test plan

When there are three components with two options for each, the total number of possible recipes is 2^3 = 8 recipes. However, by using MVT, we can run just four recipes and through analysis identify which of the combinations is the best (whether it was one of the original four we tested, or one that we didn't test), and we do this by looking at the effect each component has. The effect of each component is often called the element contribution.

In order to run the multi-variate test with four recipes (instead of an A/B/n test with all eight recipes) we need to carefully select the recipes we run - we can't just pick four at random. We need to make sure that the four recipes cover each variation of each element. for example, the set of four shown above (A-D) does not have a version with a red 'On Sale!' element, so we can't compare red against black. It is possible to run a multi-variate test to cover 2^3 combinations with just four recipe, but we'll need to be slightly more selective. Using mathematical langugage, the set of recipes that we need to use have to be orthogonal (i.e. they "point" in different directions - in geometry, 90 degrees difference - so have almost nothing in common). In IT circles, it would be called orthogonal array testing (warning: the Wikipedia entry is full of technical vocabulary).

Many tools will identify the set of recipes to test - Adobe's Test and Target does this, for example; alternatively, I'm sure that your account manager with your tool provider will be able to work with you to identify the set you need.

Here, then are the full set of eight recipes that I could have for my MVT, and the four recipes that I would need to run on my site:

The full set of eight recipes

Recipe	Gender	Colour	Wording
S	Man	Red	Sale
T	Man	Red	Buy
U	Man	Black	Sale
V	Man	Black	Buy
W	Woman	Red	Sale
X	Woman	Red	Buy
Y	Woman	Black	Sale
Z	Woman	Black	Buy

The recipes highlighted in bold represent one possible set of four recipes that would form a successful MVT set. There are others (for example, those not highlighted in bold are a complete set too).

An example set of four recipes that could be tested successfully

Recipe	Gender	Colour	Wording
A	Man	Red	Sale
B	Man	Black	Buy
C	Woman	Red	Buy
D	Woman	Black	Sale

Notice that in the full set of eight recipes, each variation (man or woman, red or black, sale or buy) appears four times each. In the subset of four recipes to be tested, each variation appears twice, and this confirms that the subset is suitable for testing.

The visuals for the four approved test recipes are:

Recipe A	RecipeB
Recipe C	Recipe D

And we can see by inspection that the four recipes do indeed have two with the man, two with the woman; two with red text and two with black; two with "Buy Now!" and two with "On Sale!"

The next step is to run the test as if it were an A/B/C/D test - with one difference: it's quite possible that one or more of the four test recipes may do very badly (or very well) compared to the others. However, it's highly recommended (but not essential) that you run all four recipes for the same length of time, and allow them to obtain equal numbers of traffic. In an MVT test run, it's important to have a large enough population of visitors for each recipe - it's not just about running until one of the four is signficantly better (or worse) than the others and calling a winner.

Analysis

Let's assume that we've run the test, and obtained the following data:

Recipe	A	B	C	D
Gender	Man	Woman	Woman	Man
Wording	Buy Now	Buy Now	On Sale	On Sale
Colour	Black	Red	Black	Red
Impressions	1010	1014	1072	1051
Clicks	341	380	421	291
Click-through rate	34%	37%	41%	28%

It looks from these results as if the winner is Recipe C; the picture of the woman, with black text saying, "On Sale!". However, there are four other recipes that we didn't test, but we can infer their relative performance by doing some judicious arithmetic with the data we have.

To begin with, we can identify which colour is better, black or red, by comparing the two recipes which have black text against the two recipes which have red text.

This might seem dangerous or confusing, but let's think about it. The two recipes which have black text are A and C. For recipe A, we have a man with "Buy Now!" and for recipe C, we have a woman with "On Sale!". The net result of combining recipe A and C is to isolate everybody who saw black text, with the other elements being reduced to noise (no net contribution from either element). This works logically when we compare A and C with the combination of B and D. B and D both have red text, but half have a man and half have a woman; half have "On Sale!" and half have "Buy Now!". The consequence of this is that we can isolate the effect of black text against red text - the other factors are reduced to noise.

We could think of this mathematically, using simple expressions:

A+C = (Man + Buy Now + Black) + (Woman + On Sale + Black)
A+C = Man + Woman + Buy Now + On Sale + 2xBlack

B+D =(Woman + Buy Now + Red) + (Man + On Sale + Red)
B+D = Man + Woman + Buy Now + On Sale + 2xRed

Subtracting one from the other, and cancelling like terms...
A+C - B+D = 2xBlack - 2xRed

When we compare A+C and B+D, we get this:

Recipe	A+C (black)	B+D (red)
Total impressions	2082	2065
Total clicks	762	671
CTR	36.6%	32.5%

So we can see that A+B (black) is better than C+D (red) - and we can attribute an element contribution of +12.63% to the colour black.

We can also do the maths to obtain the best gender and wording:

Gender: A+D = man, B+C = woman

Recipe	A+D	B+C
Total impressions	2061	2086
Total clicks	632	801
CTR	30.7%	38.4%

Result: woman is 25.2% better than man (on CTR in this test ;-) )

Wording: A+B = Buy Now, C+D = On Sale

Recipe	A+B	C+D
Total impressions	2024	2123
Total clicks	721	712
CTR	35.6%	33.5%

Result: Buy Now is 6.22% better than On Sale

Summarising our results:

Result: black is 12.63% better than red
Result: woman is 25.2% better than man

Result: Buy Now is 6.22% better than On Sale

The winner!

The winning combination is black, buy now with woman, which is one that we didn't actually include in our test recipes. The recommended follow-up is to test the winning recipe from the four that we did test against the proposed winner from the analysis we've just done. Where that isn't possible, for whatever reason, you could test your existing control design against the proposed winner. Alternatively, you could just go implement the theoretical winner without testing - it's up to you.

A brief note on the analysis: this shows the importance of keeping all test recipes running for an equal length of time, so that they receive approximatley equal volumes of traffic. Here, recipes A, B, C and D all received around 1000 impressions, but if one of them had significantly fewer (because it was switched off early because it "wasn't performing well") then that recipe would not have an equal weighting in the calculations where we compared the pairs of recipes, and its perceived performance would be higher than its actual.

I hope I've been able to show in this article (and the previous one) how it's possible to set up and analyse a multi-variate test, starting with the principles of identifying the variables you want to test, then establishing which recipes are required, and then showing how to analyse the results you obtain.

Here's my series on Multi Variate Testing

Preview of Multi Variate testing
Web Analytics: Multi Variate testing
Explaining complex interactions between variables in multi-variate testing
Is Multi Variate Testing an Online Panacea - or is it just very good?
Is Multi Variate Testing Really That Good
Hands on: How to set up a multi-variate test
And then: Three Factor Multi Variate Testing - three areas of content, three options for each!
---

Image credits:
man - http://www.findresumetemplates.com/job-interview
woman - http://www.sheknows.com/living

Thursday, 28 August 2014

Telling a Story with Web Analytics Data

Management demands actionable insights - not just numbers, but KPIs, words, sentences and recommendations. It's therefore essential that we, as web analysts and optimisers, are able to transform data into words - and better still, stories. Consider a report with too much data and too little information - it reads like a science report, not a business readout:

Consider a report concerning four main characters;
Character A: female, aged 7 years old. Approximately 1.3 metres tall.
Character B: male, aged 5 years old.
Character C: female, aged 4 years old.
Character D: male, aged 1 year old.

The main items in the report are a small cottage, a 1.2 kw electric cooker, 4 pints of water, 200 grams of dried cereal and a number of assorted iron and copper vessels, weighing 50-60 grams each.

After carrying out a combination of most of the water and dried cereal, and in conjunction with the largest of the copper vessels, Character B prepared a mixture which reached around 70 degrees Celsius. He dispensed this unevenly into three of the smaller vessels in order to enable thermal equilibrium to be formed between the mixture and its surroundings. Characters B, C and D then walked 1.25 miles in 30 minutes, averaging just over 4 km/h. In the interim, Character A took some empirical measurements of the chemical mixture, finding Vessel 1 to still be at a temperature close to 60 degrees Celsius, Vessel 2 to be at 70 degrees Fahrenheit and Vessel 3 to be at 315 Kelvin, which she declares to be optimal.

The report continues with Character A consuming all of the mixture in Vessel 3, then single-handedly testing (in some case destruction testing) much of the furniture in the small cottage.

The problem is: there's too much data and not enough information.

The information is presented in various formats - lists, sentences and narrative.

Some of it the data is completely irrelevant (the height of Character A, for example)
Some of it is misleading (the ages of the other characters lacks context);
Some of it is presented in a mish-mash of units (temperatures are stated four times, with three different units).
The calculation of the speed of the walking characters is not clear - the distance is given in miles; the time is given in minutes; and the speed in kilometres per hour (if you are familiar with the abbreviation km/h).

Of course, this is an exaggeration, and as web analytics professionals, we wouldn't do this kind of thing in our reporting.

Visitors are called visitors, and we consistently refer to them as visitors (and we ensure that this definition is understood among our readers)
Conversion rates are based on visitors, even though this may require extra calculation since our tools provide figures based on visits (or sessions)
Percentage of traffic coming from search is quoted as visitors (not called users), and not visits (whether you use visitors or visits is up to you, but be consistent)
Would you include number of users who use search? And the conversion rate for users of search?
And when you say 'Conversion', are you consistently talking about 'user added an item to cart', or 'user completed a purchase and saw the thank-you page'?
Are you talking about the most important metrics?

So - make sure, for starters, that your units and data and KPIs are consistent, contextual, or at least make sense. And then: add the words to the numbers. It's only the start to say that: "We attracted 500 visitors with paid search, at a total cost of £1,200." Go on to talk about the cost per visitor, break it down into key details by talking about the most expensive keywords, and the ones that drove the most traffic. But then tell the story: there's a sequence of events between user seeing your search term, clicking on your ad, visiting your site, and [hopefully] converting. Break it down into chronological steps and tell the story!

There are various ways to ensure that you're telling the story; my favourites are to answer these types of questions:
"You say that metric X has increased by 5%. Is that a lot? Is that good?"
"WHY has this metric gone up?"
"What happened to our key site performance indicators (profit, revenue, conversion) as a result?"
and my favourite:
"What should we do about it?"

There are, of course, various ways to hide the story, or disguise results that are not good (i.e. do not meet sales or revenue targets) - I did this in my anecdote at the start. However, management tend to start looking at incomplete data, or data that's obscure or irrelevant, and go on to ask about the data that's "missing"... so the truth will out, so it's better to show the data, tell the whole story, and highlight why things are below par.

It's our role to highlight when performance is down - we should be presenting the issues (nobody else has the tools to do so) and then going on to explain what needs to be done - this is where actionable insights become invaluable. In the end, we present the results and the recommendations and then let the management make the decision - I blogged about this some time ago - web analytics: who holds the steering wheel?

In the case of Characters A, B, C and D, I suggest that Characters B and C buy a microwave oven, and improve their security to prevent Character A from breaking into their house and stealing their breakfast. In the case of your site, you'll need to use the data to tell the story.

Thursday, 14 August 2014

I am a power-tool A/B skeptic

I have recently enjoyed reading Peter W Szabo's article entitled, "I am an A/B testing skeptical." Sure, it's a controversial title (especially when he shared it in the LinkedIn Group for web analysts and optimisers), but it's thought-provoking nonetheless.

And reading it has made me realise: I am a power-drill skeptic. I've often wondered what the benefit of having the latest Black and Decker power tool might actually be. After all, there are plenty of hand drills out there that are capable of drilling holes in wood, brick (if you're careful) and even metal sheet. The way I see it, there are five key reasons why power drills are not as good as hand-drilling (and I'm not going to discuss the safety risks of holding a high-powered electrical device in your hand, or the risks of flying dust and debris).

5. There's no consistency in the size of hole I drill.

I can use a hand drill and by watching how hard I press and how quickly I turn the handle, I can monitor the depth and width of the hole I'm drilling. Not so with a power drill - sometimes it flies off by itself, sometimes it drills too slowly. I have read about this online, and I've watched some YouTube videos. I have seen some experienced users (or professionals, or gurus, or power users) drill a hole which is 0.25 ins diameter and 3 ins deep, but when I try to use the same equipment at home, I find that my hole is much wider (especially at the end) and often deeper. Perhaps I'm drilling into wood and they're drilling into brick? Perhaps I'm not using the same metal bits in my power drill? Who knows?

4. Power drill bits wear out faster.

Again, in my experience, the drill bits I use wear out more quickly with a power drill. Perhaps leaving them on the side isn't the best place for them, especially in a damp environment. I have found that my hand drill works fine because I keep it in my toolbox and take care of it, but having several drill bits for my power tool means I don't have space or time to keep track of them all; what happens is that I often try to drill with a power-drill bit that's worn out and a little bit rusty, and the results aren't as good as when the drill bits were new. The drill bits I buy at Easter are always worn out and rusty by Christmas.

The professionals always seem to be using shiny new tools and bits, but not me. But, as I said, this hasn't been a problem previously because having one hand-drill with only a small selection of bits has made it easier to keep track of them. That's a key reason why power tools aren't for me.

3. Most power drills are a waste of time.

Power drills are expensive, especially when compared to the hand tool version. They cost a lot of money, and what's the most you can do with them? Drill holes. Or, with careful use, screw in screws. No, they can't measure how deep the hole should be, or how wide. Some models claim to be able to tell you how deep your hole is while you're drilling it, but that's still pretty limited. When I want to put up a shelf, I end up with a load of holes in a wall that I don't want, but that's possibly because I didn't think about the size of the shelf, the height I wanted it or what size of plugs I need to put into the wall to get my shelf to stay up (and remain horizontal). Maybe I should have measured the wall better first, or something.

2. I always need more holes

As I mentioned with power drills being a waste of time, I often find that compared to the professionals I have to drill a lot more holes than usual. They seem to have this uncanny ability to drill the holes in exactly the right places (how do they do that?) and then put their bookshelves up perfectly. They seem to understand the tools they're using - the drill, the bits, the screws, the plugs, the wall - and yet when I try to do this with one of their new-fangled power-drills, I end up with too many holes. I keep missing what I'm aiming for; perhaps I need more practice. As it is, when I've finished one hole, I can often see how I could make it better and what I need to do, and get closer and closer with each of the subsequent holes I drill. Perhaps the drill is just defective?

1. Power drills will give you holes, but they won't necessarily be the right size

This pretty much sums up power drills for me, and the largest flaw that's totally inherent in power tools. I've already said that they're only useful for drilling holes, and that the holes are often too wide, too short and in the wrong place. In some cases, when one of my team has identified that the holes are in the wrong place, they've been able to quickly suggest a better location - only to then find that that's also incorrect, and then have two wrong holes and still no way of completing my job. It seems to me that drilling holes and putting up bookshelves (or display shelving, worse still) is something that's just best left to the professionals. They can afford the best drill bits and the most expensive drills, and they also have the money available to make so many mistakes - it's clear to me that they must have some form of Jedi mind power, or foreknowledge of the kinds of holes they want to drill and where to drill them.

In conclusion:

Okay, you got me, perhaps I am being a little unkind, but I genuinely believe that A/B testing and the tools to do it are extremely valuable. There are a lot of web analytics and A/B professionals out there, but there is also a large number of amateurs who want to try their hand at online testing and who get upset or confused when it doesn't work out. Like any skilled profession, if you want to do analytics and optimisation properly, you can be sure it's harder than it looks (especially if it looks easy). It takes time and thought to run a good test (let alone build up a testing program) and to make sure that you're hitting the target you're aiming for. Why are you running this test? What are you trying to improve? What are you trying to learn? It takes more than just the ability to split traffic between two or more designs to run a test.

Yes, I've parodied Peter W Szabo's original article, but that's because it seemed to me the easiest way to highlight some of the misconceptions that he's identified, and which exist in the wider online optimisation community - especially the ideas that 'tests will teach you useful things', and the underlying misconception that 'testing is quick and easy'. I will briefly mention that you need a reason to run a test (just as you need a reason to drill a hole) and you need to do some analytical thinking (using other tools, not just testing tools) in the same way as you would use a spirit level, a pencil and a ruler when drilling a hole.

Drilling the hole in the wall is only one step in the process of putting up a bookshelf; splitting traffic in a test should be just one step in the optimisation process, and should be preceded by some serious thought and design work, and followed up with careful review and analysis. Otherwise, you'll never put your shelf up straight, and your tests will never tell you anything.