Web Optimisation, Maths and Puzzles

Monday, 3 March 2014

Multi Variate Testing - Online Panacea?

I've discussed multi variate testing previously - outlining the theory, the ideas, the maths and ways in which it can be done. But, in my discussions with other web analytics and optimisation professionals, it seems that MVT isn't really being used all that widely. This surprised me at first - after all, the number of tools vendors and suppliers who offer MVT is growing all the time, and I assumed from their sales material that it was the next level of A/B testing and the future of online optimisation. Additionally, it's often marketed as an online panacea, that will highlight the way forwards for your ecommerce business, and bring in double-digit growth (in whichever metric you'd care to measure).

However, out of a dozen or so online professionals that I've spoken to in EMEA, only one had tried it, and had obtained mixed results. So, why isn't it being taken up and used as widely as I'd expected? Here are some possibilities:

1. It's difficult to code
2. It's difficult to identify MVT opportunities
3. It's quicker to do an A/B test
4. It's difficult to explain to the Boss

Let's look at take a look at a simple example of MVT, which will hopefully address the first two challenges that online optimisation professionals face. I say 'simple', but it's easier than most test ideas because it concerns making some straightforward changes to a web page: taking things away.

Our content pages; our product pages; our shopping and ecommerce pages are all full of the most important content we can produce for our visitors - glossy images; descriptive text; eye-catching call-to-action buttons; all working together to produce the perfect digital shopping experience. Or perhaps they aren't. Perhaps it's a huge mish-mash of competing elements, some of which are helping, and some of which are distracting users and putting them off. So: what's working, and what isn't?

Let's take an example from maplin.co.uk - they sell a wide range of electronics and electrical items. I've selected one at random, a keyring torch. I've highlighted below various parts of the page which could be removed as part of a test (I should probably say at this point that this test will require access to the global template for product description pages - if this isn't going to work for you, read ahead to another example).

Click on the image to see a larger version.

The product page is very similar to many other ecommerce pages (similar layouts are used on various sites to sell clothes, furniture, games, toys... you name it). But what's the value of each component, and how do they work together? I've covered interactions between elements in MVT previously. The easiest way of working out the optimum combination of elements is to selectively remove them in a multi-variate test.

Here's the recipe definition for each of the various combinations that are possible:

Recipe	Reviews	Social	Tabs	Banner
A	Yes	Yes	Yes	Yes
B	Yes	Yes	Yes	No
C	Yes	Yes	No	Yes
D	Yes	Yes	No	No
E	Yes	No	Yes	Yes
F	Yes	No	Yes	No
G	Yes	No	No	Yes
H	Yes	No	No	No
I	No	Yes	Yes	Yes
J	No	Yes	Yes	No
K	No	Yes	No	Yes
L	No	Yes	No	No
M	No	No	Yes	Yes
N	No	No	Yes	No
O	No	No	No	Yes
P	No	No	No	No

Note that Recipe A is the control state (with all elements present) and Recipe P is removing everything; there are then the various combinations of the four elements in between the two. (If you're feeling mathematical, you can review how the patterns for each of the four elements changes in a binary-type way - 1000, 1001, 1010, etc. and how the table has certain symmetries). The number of recipes can be calculated by the number of options for each element (yes or no means 2 options), raised to the power of the number of elements (four elements) so 2⁴= 2 x 2 x 2 x 2 = 16 recipes.

So: sixteen recipes like this is simply not realistic for a normal A/B/C/D/n test. The traffic requirements are far too high, and you'd probably be waiting six months for results. However, because the elements are independent (you don't have to have the reviews included to have the social bar), we can carry out a multivariate test which has only a sample of these recipes, selected to ensure even coverage of the four elements, and which will (with the appropriate tools) enable you to work out the optimal combination, even if you didn't test it.
This example was on a product information page, and as I mentioned above, if you want to test here, your coders will need access to the global template file so that you can run the test across all product information pages. There are, however, single-page options that would work just as well:
- landing pages for online/offline marketing campaigns
- your home page
- checkout pages

In these cases, each page is (typically) built for a specific purpose and with specific content, so you have much more flexibility on what you can test. For example, should you have a "Chat online" option and a telephone number on your landing page, as well as an option for online feedback? Are all three really needed?

This testing has some key advantages:

1. You can test a large number of element changes on the page in one go
2. You can understand (with accurate analysis) the contribution each element makes to page performance
3. There's no new content required from the design or marketing teams - you're only handling existing content - so no reliance on them for images or content.
4. It's usually easier to remove page elements with code than it is to insert them, so your code developers will be happier
5. It's relatively easy to explain what you've tested to the Boss.
6. In this case, it's definitely quicker than A/B testing, and the more elements you choose to test, the larger the advantage becomes.

It also has some key requirements:

A. You're going to need to be able to interpret the results. This will require some careful analysis and understanding of the maths behind multi-variate testing, in order to work out what each element is contributing (in a positive or negative way). Many of the tools that are available (here's a list of some of them) offer and promise this kind of analysis, but I'm not aware of it being widely used, so it may be prudent to discuss your requirements with your account manager (I don't work for a tool provider). You don't really want to get to the end of a test and discover that you have spent eight weeks collectin a mountain of data that you can't climb... that would really require some explaining to the boss.

B. You're going to need more traffic than a typical A/B test, even if you're using a mathemetical method (such as the Taguchi method) to reduce the recipe requirements, so be prepared to wait longer than usual for your results.

I hope in this blog post I've been able to encourage you to think about using MVT, and shown you how to overcome some of the initial hurdles to getting an MVT idea together - and hopefully into execution. Please do let me know (either in the comments, or by contacting me) how your efforts go!

Here's my series on Multi Variate Testing

Preview of Multi Variate testing
Web Analytics: Multi Variate testing
Explaining complex interactions between variables in multi-variate testing
Is Multi Variate Testing an Online Panacea - or is it just very good?
Is Multi Variate Testing Really That Good - (that's this article)
Hands on: How to set up a multi-variate test
And then: Three Factor Multi Variate Testing - three areas of content, three options for each!

Thursday, 9 January 2014

When Good Tests Fail

Seth Godin, online usability expert recently stated simply that, 'The answer to the question, "What if I fail?" is "You will." The real question is, "What after I fail?"'

Despite rigorous analytics, careful usability studies and thoughtful designing, the results from your latest A/B test are bad. Conversion worsened; average order value plummeted and people bounced off your home page like it was a trampoline. Your test failed. And, if you're taking it personally (and most online professionals do take it very personally), then you failed too.

But, before the boss slashes your optimisation budget, you have the opportunity to rescue the test, by reviewing all the data and understanding the full picture. Your test failed - but why? I've mentioned before that tests which fail draw far more attention than those which win - it's just human nature to explore why something went wrong, and we like to attribute blame or responsibility accordingly. That's why I pull apart my Chess games to find out why I lost. I want to improve my Chess (I'm not going to stop playing, or fire myself from playing Chess).

So, the boss asks the questions- Why did your test fail? (And it's suddenly stopped being his test, or our test... it's yours). Where's the conversion uplift we expected? And why aren't profits rising?

It's time to review the test plan, the hypothesis and the key questions. Which of these apply to your test?

Answer 1. The hypothesis was not entirely valid. I have said before that, "If I eat more chocolate, I'll be able to run faster because I will have more energy." What I failed to consider is the build up of fat in my body, and that eating all that chocolate has made me heavier, and hence I'm actually running more slowly. I'm not training enough to convert all that fat into movement, and the energy is being stored as fat.

Or, in an online situation: the idea was proved incorrect. Somewhere, one of the assumptions that was made was wrong. This is where the key test questions come in. The analysis that comes from answering these key questions will help retrieve your test from 'total failure' to 'learning experience'.

Sometimes, in an online context, the change we made in the test had an unforeseen side-effect. We thought we were driving more people from the product pages to the cart, but they just weren't properly prepared. We had the button at the bottom of the page, and people who scrolled to the bottom of the page saw the full specs of the new super-toaster and how it needs an extra battery-pack for super-toasting. We moved the button up the page, more people clicked on it, but realised only at the cart page that it needed the additional battery pack. We upset more people than we helped, and overall conversion went down.

Unforeseen side-effects in testing leading to adverse performance:
too much chocolate slows down 100m run times due to increased body mass

Answer 2. The visual design of the test recipe didn't address the test hypothesis or the key test questions. In any lab-based scientific experiment, you would expect to set up the apparatus and equipment and take specific measurements based on the experiment you were doing. You would also set up the equipment to address the hypothesis - otherwise you're just messing about with lab equipment. For example, if you wanted to measure the force of gravity and how it affects moving objects, you wouldn't design an experiment with a battery, a thermometer and a microphone.

However, in an online environment, this sort of situation becomes possible, because different people possess the skills required to analyse data and the skills to design banners etc, and the skills to write the HTML or JavaScript code. The analyst, the designer and the developer need to work closely together to make sure that the test design which hits the screen is going to answer the original hypothesis, and not something else that the designer believes will 'look nice' or that the developer finds easier to code. Good collaboration between the key partners in the testing process is essential - if the original test idea doesn't meet brand guidelines, or is extremely difficult to code, then it's better to get everybody together and decide what can be done that will still help prove or disprove the hypothesis.

To give a final example from my chocolate-eating context, I wouldn't expect to prove that chocolate makes me run faster by eating crisps (potato chips) instead. Unless they were chocolate-coated crips? Seriously.

Answer 3. Sometimes, the test design and execution was perfect, and we measured the right metrics in the right way. However, the test data shows that our hypothesis was completely wrong. It's time to learn something new...!

My hypothesis said that chocolate would make me run faster; but it didn't. Now, I apologise that I'm not a biology expert and this probably isn't correct, but let's assume it is, review the 'data' and find out why.

For a start, I put on weight (because chocolate contains fat), but worse still, the sugar in chocolate was also converted to fat, and it wasn't converted back into sugar quickly enough for me to benefit from it while running the 100 metres. Measurements of my speed show I got slower, and measurements of my blood sugar levels before and after the 100 metres showed that the blood sugar levels fell, because the fat in my body wasn't converted into glucose and transferred to my muscles quickly enough. Additionally, my body mass rose 3% during the testing period, and further analysis showed this was fat, not muscle. This increased mass also slowed me down.

Back to online: you thought people would like it if your product pages looked more like Apple's. But Apple sell a limited range of products - one phone, one MP3 player, one desktop PC, etc. while you sell 15-20 of each of those, and your test recipe showed only one of your products on the page (the rest were hidden behind a 'View More' link), when you get better financial performance from a range of products. Or perhaps you thought that prompting users to chat online would help them go through checkout... but you irritated them and put them off. Perhaps your data showed that people kept leaving your site to talk to you on the phone. However, when you tested hiding the phone number, in order to get people to convert online, you found that sales through the phone line went down, as expected, but your online sales also fell because people were using the phone line for help completing the online purchase. There are learnings in all cases that you can use to improve your site further - you didn't fail, you just didn't win ;-)

In conclusion Yes, sometimes test recipes lose. Hypotheses were incorrect, assumptions were invalid, side-effects were missed and sometimes the test just didn't ask the question it was meant to. The difference between a test losing and a test failing is in the analysis, and that comes from planning - having a good hypothesis in the first place, and asking the right questions up front which will show why the test lost (or, let's not forget, the reason why a different test won). Until then, fail fast and learn quickly!

Tuesday, 7 January 2014

The Key Questions in Online Testing

As you begin the process of designing an online test, the first thing you'll need is a solid test hypothesis. My previous post outlined this, looking at a hypothesis, HIPPOthesis and hippiethesis. To start with a quick recap, I explained that a good hypothesis says something like, "IF we make this change to our website, THEN we expect to see this improvement in performance BECAUSE we will have made it easier for visitors to complete their task." Often, we have a good idea about what the test should be - make something bigger, have text in red instead of black... whatever.

Stating the hypothesis in a formal way will help to draw the ideas together and give the test a clear purpose. The exact details of the changes you're making in the test, the performance change you expect, and the reasons for the expected changes will be specific to each test, and that's where your web analytics data or usability studies will support your test idea. For example, if you're seeing a large drop in traffic between the cart page and the checkout pages, and your usability study shows people aren't finding the 'continue' button, then your hypothesis will reflect this.

In between the test hypothesis and the test execution are the key questions. These are the key questions that you will develop from your hypothesis, and which the test should answer. They should tie very closely to the hypothesis, and they will direct the analysis of your test data, otherwise you'll have test data that will lack a focus and you'll struggle to tell the story of the test. Think about what your test should show - what you'd like it to prove - and what you actually want to answer, in plain English.

Let's take my offline example from my previous post. Here's my hypothesis: "If I eat more chocolate, then I will be able to run faster because I will have more energy."

It's good - but only as a hypothesis (I'm not saying it's true, or accurate, but that's why we test!). But before I start eating chocolate and then running, I need to confirm the exact details of how much chocolate, what distance and what times I can achieve at the moment. If this was an ideal offline test, there would be two of me, one eating the chocolate, and one not. And if it was ideal, I'd be the one eating the chocolate :-)

So, the key questions will start to drive the specifics of the test and the analysis. In this case, the first key question is this: "If I eat an additional 200 grams of chocolate each day, what will happen to my time for running the 100 metres sprint?"

It may be 200 grams or 300 grams; the 100m or the 200m, but in this case I've specified the mass of chocolate and the distance. Demonstrating the 'will have more energy' will be a little harder to do. In order to do this, I might add further questions, to help understand exactly what's happening during the test - perhaps questions around blood sugar levels, body mass, fat content, and so on. Note at this stage that I haven't finalised the exact details - where I'll run the 100 metres, what form the chocolate will take (Snickers? Oreos? Mars?), and so on. I could specify this information at this stage if I needed to, or I could write up a specific test execution plan as the next section of my test document.

In the online world I almost certainly will be looking at additional metrics - online measurements are rarely as straightforward as offline. So let's take an online example and look at it in more detail.

"If I move the call-to-action button on the cart page to a position above the fold, then I will drive more people to start the checkout process because more people will see it and click on it."

And the key questions for my online test?

"How is the click-through rate for the CTA button affected by moving it above the fold?"
"How is overall cart-to-complete conversion affected by moving the button?"
"How are these two metrics affected if the button is near the top of the page or just above the fold?"

As you can see, the key questions specify exactly what's being changed - maybe not to the exact pixel, but they provide clear direction for the test execution. They also make it clear what should be measured - in this case, there are two conversion rates (one at page level, one at visit level). This is perhaps the key benefit of asking these core questions: they drive you to the key metrics for the test.

"Yes, but we want to measure revenue and sales for our test."

Why? Is your test meant to improve revenue and sales? Or are you looking to reduce bounce rate on a landing page, or improve the consumption of learn content (whitepapers, articles, user reviews etc) on your site? Of course, your site's reason-for-being is to general sales and revenue. Your test data may show a knock-on improvement on revenue and sales, and yes, you'll want to make sure that these vital site-wide metrics don't fall off a cliff while you're testing, but if your hypothesis says, "This change should improve home page bounce rate because..." then I propose that it makes sense to measure bounce rate as the primary metric for the test success. I also suspect that you can quickly tie bounce rate to a financial metric through some web analytics - after all, I doubt that anyone would think of trying to improve bounce rate without some view of how much a successful visitor generates.

So: having written a valid hypothesis which is backed by analysis, usability or other data (and not just a go-test-this mentality from the boss), you are now ready to address the critical questions for the test. These will typically be, "How much....?" and "How does XYZ change when...?" questions that will focus the analysis of the test results, and will also lead you very quickly to the key metrics for the test (which may or may not be money-related).

I am not proposing to pack away an extra 100 grams of chocolate per day and start running the 100 metres. It's rained here every day since Christmas and I'm really not that dedicated to running. I might, instead, start on an extra 100 grams of chocolate and measure my body mass, blood cholesterol and fat content. All in the name of science, you understand. :-)

Monday, 6 January 2014

Chess: King's Gambit 1. e4 e5 2. f4

After my most recent post, where I played as Black against the Bird Opening 1. f4, in this post I'd like to cover another game where I again faced White playing an f4 opening - in this case, the King's Gambit. I played this against one of my Kidsgrove Chess Club team mates, and this time, I lost. Badly. I had just suffered a difficult 32-move defeat against the team's top player, and I will make the excuse that I wasn't playing my best here. I will cover that game in a later blog post.

I'm covering this game here, because interestingly, in the previous game where I played against 1. f4, I won by playing Qh4+ early on in the game. If I'd been more aware, I might have seen it here, too.

David Johnson vs Dave Leese, 17 December 2013 Kidsgrove Chess Club (Friendly)

1.e4 e5
2.f4 Nc6
3.fxe5

It's unusual to have a critical point in a game so early, but here it is. I played the natural recapture with Nxe5 and slowly got into all sorts of trouble. I missed the Qh4+ move that I played (and won with) just two weeks earlier.

3. ... Nxe5 ( 3...Qh4+ )

Let's look at 3. ... Qh4+ before reviewing the actual game in full. There are two replies - to move the King to e2 or to block with a Pawn on g3.

3. ... Qh4+
4. Ke2 Qxe4+
5. Kf2 Bc5+

If then 6. Kg3, then Chessmaster 9000 (my preferred analysis tool) is already giving a mate in 8, starting with 6. ... h5. The other option is 6. d4 and this is going to mean a significant loss of material - 6 ... Bxd4+ 7. Qxd4 Qxd4+ and White has delayed the inevitable at the cost of his Queen.

However, I completely missed this overwhelming attack, and instead went through a painful game where I fell into all sorts of trouble. Let's resume after 3. ... Nxe5

4.d4 Ng6 (I could still have played ... Qh4+, or ... Bb4+ at this point).
5.Nf3 d6 (there are no more chances of ... Qh4+ now, and Chessmaster recommends ... d5).
6.Bc4 h6 (preventing Ng5 and an attack on f7 - in theory, anyway).
7.O-O Bg4? (a big mistake, as we shall see. Better was Nf6 or Be6... something - anything - to protect f7. White's decision to castle was not just a natural move at this stage, it moved the Rook onto the semi-open and dangerous f-file).

The position after 7. ... Bg4 and immediately before Bxf7+!

8.Bxf7+! Ke7 ( not 8...Kxf7 which leads to Ng5++ and Nf7 forking Queen and Rook)
9.Bxg6 Nf6 (finally!)
10.Nc3 c6 (opening a diagonal for my Queen, and providing space for my King)
11.Qe1 Kd7 (taking advantage of the slow pace of the game to improve my King and Queen)
12.e5 Nd5
13.Nxd5 cxd5

White played Qg3? and missed exd6 with a large material gain.

14.Qg3? Be6 (my wayward Bishop finally gets a decent square, even if the Pawn on f7 is gone)
15.Nh4 Qb6 (developing, and attacking the newly-unprotected d4 Pawn)
16.c3 Be7 (perhaps ... dxe5 was better)
17.Nf5 Rhf8
18.Nxe7 Rxf1+
19.Kxf1 Rf8+ (No, I'm not sure why I threw this in. I needed as many pieces as possible on the board, but at least I got rid of White's active Rook in exchange for my inactive one).
20.Kg1 Kxe7
21.exd6+ Kd7
22.Qe5 Qxd6 (offering an exchange, but also protecting the Rook on f8).
23.Qxg7+ Kc6
24.Qe5 Qxe5
25.dxe5 Rg8
( 25...Bh3 26.gxh3 Rg8 27.Bf4 Rxg6+ )
26.Bh5 Bh3
27.Bf3 Bf5
28.Bxh6 1-0

The final position. I've run out of ideas, I'm a piece and three pawns down, and I've had enough! Seeing afterwards that I missed several opportunities for a massive attack in the first few moves, has made me more confident in my attacking options, and how I missed my opponent's attack developing (in this game and the previous one) has made me even more aware of the need to defend accurately too. Yes I put up a fight, but really I was defending a lost cause due to some daft blunders. On with the next game!

Friday, 3 January 2014

Chess: Bird Opening 1 f4 2 e6

After a long break from Chess writing, I'm returning with an analysis of a game I recently played face-to-face. I've played almost exclusively online for a number of years, and recently decided it was time to start playing 'real' people. I've joined Kidsgrove Chess Club, and signed up to the English Chess Federation in order to obtain a ranking through the games I play.

My first game in this new face-to-face era is a friendly - the rest of the club were involved in a match against Stafford Chess Club, and I was able to play against one of the Stafford team after he completed his game.

I played Black, and for possibly the first time ever, I faced the Bird Opening, 1 f4. My reply was 1. ... e6, intending at some early point to get my Queen to h4 and deliver a potentially uncomfortable check.

David Barker vs Dave Leese, 4 December 2013, Kidsgrove Chess Club (Home, Friendly)

1.f4 e6
2.b3 d5
3.Bb2 Nf6
4.e3 c5

If there is space in the centre to be claimed, I'll claim it. My opponent seems to be playing for a very slow build-up. I am resisting my urge to play my normal, attacking game and am playing cautiously - after all, I don't want to lose quickly in my very first game in front of my new team-mates. Also, I am concerned about the white Bishop on b2 and the way it looks into my kingside - I brought out my Knight to f3 in order to liberate my Bishop from having to defend h7. In time, I may play d4 and look to shut White's Bishop out of the game.

5.Nf3 Nc6
6.Be2 Bd6
7.d4

In this position, I opted to play 7. ... O-O. I don't want to capture on d4 - White could recapture with his Bishop or his Knight and start to develop a grip on the centre. Additionally, I also have a brief tactic developing where I begin to attack the now-backward pawn on e3. Here's what I was thinking at that time:

7. ... O-O
8. dxc5 Bxc5 attacking e3
9. Nd4 Nxd4

Now after: 10. exd4 White has weakened pawns, and I can attack them with 10. ... Bd6 11. O-O Qc7, and I have the options of moving my Knight on f6 to an even better square.

The position after my theoretical 11. ... Qc7

However, the game didn't proceed that way at all, and we resume the game after my move 7. ... O-O

8. Ne5? Ne4

I was very surprised by my opponent's decision to play Ne5. This Knight is the only piece protecting the dark squares on the kingside and preventing me from getting in a Qh4+ and hopefully initiating a king-side attack at some point. I moved my Knight to e4 in order to open the diagonal for my Queen, and also to take a look at f2, the weak spot in white's position. It's also a great outpost for my Knight, as my opponent has played f4 and d4.

9. Nd2 Qh4+ 0-1

I suspect my opponent saw the threat of my Knight on e4, but missed the Queen check, and after a moment's thought, resigned the game. The immediate threat is 10. g3 Nxg3 11. Rg1 Qxh2 or 11. Ndf3 Qh5 which wins a pawn and starts a moderate king-side attack (preventing White from castling king-side and causing longer term complications). I was surprised at the early resignation, but pleased that my first face-to-face game in front of my new team-mates was a win.

The position after 9 ... Qh4+ and White resigned.

We took back White's ninth move, and instead White played 9. O-O, and play continued:

9. O-O f6?
10.Nxc6 bxc6
11.Nd2 Nxd2 (I'm not sure I should have given up my Knight on this great outpost, but I suspect my opponent would have swapped them off anyway).
12.Qxd2 cxd4

I took this pawn in order to straighten out my own doubled pawns.

13.Bxd4 c5
14.Bb2 Bb7
15.Rad1 Qb6?

I missed the gathering threat on the d-file. I moved my Queen to an attacking position, planning to advance my c-pawn and expose an attack on the diagonal to the King, and missed the attack on my own d-pawn. Following this, I got into a potentially very messy position where I could have lost at least a pawn.

16.c4 Qc6
17.Bf3 Qa6
18.Bc3 Bc6
19. e4?

After a lot of shuffling around (which surprised me, I was sure my position was going to fall apart) my opponent played 19. e4. I was lining up my Bishop and Queen so that I could re-capture on d4 with my Bishop before my Queen. One benefit of 17... Qa6 was the attack on a2, which required a defence, but otherwise, I was scrambling around for acceptable moves until this point. 19. e4 gave me the chance I needed to reinforce my position and get out of trouble, and I played this move very quickly.

19.... d4
20.Ba1 Rab8 (moving onto the semi-open file)
21.Qe2? Bxf4
22.Bg4 Be3+
23.Kh1 Bxe4

My opponent later said he wasn't having a great night - he'd previously played another member of the Kidsgrove team and lost, and I guess he was becoming tired. Or just having a bad day, but I had moved two pawns ahead through two blunders (although 22. ... Be3+ is one of my favourite moves of this game). After 23 ... Bxe4 I was two pawns up and they were connected passed pawns on the d- and e- files. There were a few exchanges made as I started to trade off pieces, then I began advancing my passed pawns and getting my rooks involved. After move 29, my opponent resigned, as I completed my defence and started looking to push my passed pawns.

24.Bf3 Bxf3
25.Rxf3 e5
26.Rg3 Rfe8
27.Qc2 e4
28.Qe2 Bf4
29.Rh3 Rbd8 0-1

The final position, after 29 Rbd8.

Not a perfect game, and I made a few blunders throughout, but held together and took the opportunities. We discussed possible continuations, and Qh5 and Qxc5 look like a good start for White, but Black can reply with Qxa2, and Black's passed pawns present a continued threat. All in all, an interesting game, and an enjoyable return to face-to-face Chess.

I have played some better games though, and I can recommend these:

Playing the English Defence
My first face-to-face club game
My earliest online Chess game
My very earliest Chess game (it was even earlier than I thought)
The Chess game I'm most proud of - where I made the situation too complicated for my opponent, causing him to lose a piece; I then found a fork and finished off with a piece sacrifice

Wednesday, 24 July 2013

The Science of A Good Hypothesis

Good testing requires many things: good design, good execution, good planning. Most important is a good idea - or a good hypothesis, but many people jump into testing without a good reason for testing. After all, testing is cool, it's capable of fixing all my online woes, and it'll produce huge improvements to my online sales, won't it?

I've talked before about good testing, and, "Let's test this and see if it works," is an example of poor test planning. A good idea, backed up with evidence (data, or usability testing, or other valid evidence) is more likely to lead to a good result. This is the basis of a hypothesis, and a good hypothesis is the basis of a good test.

What makes a good hypothesis? What, and why.

According to Wiki Answers, a hypothesis is, "An educated guess about the cause of some observed (seen, noticed, or otherwise perceived) phenomena, and what seems most likely to happen and why. It is a more scientific method of guessing or assuming what is going to happen."

In simple, testing terms, a hypothesis states what you are going to test (or change) on a page, what the effect of the change will be, and why the effect will occur. To put it another way, a hypothesis is an "If ... then... because..." statement. "If I eat lots of chocolate, then I will run more slowly because I will put on weight." Or, alternatively, "If I eat lots of chocolate, then I will run faster because I will have more energy." (I wish).

However, not all online tests are born equal, and you could probably place the majority of them into one of three groups, based on the strength of the original theory. These are tests with a hypothesis, tests with a HIPPOthesis and tests with a hippiethesis.

Tests with a hypothesis

These are arguably the hardest tests to set up. A good hypothesis will rely on the test analyst sitting down with data, evidence and experience (or two out of three) and working out what the data is saying. For example, the 'what' could be that you're seeing a 93% drop-off between the cart and the first checkout page. Why? Well, the data shows that people are going back to the home page, or the product description page. Why? Well, because the call-to-action button to start checkout is probably not clear enough. Or we aren't confirming the total cost to the customer. Or the button is below the fold.

So, you need to change the page - and let's take the button issue as an example for our hypothesis. People are not progressing from cart to checkout very well (only 7% proceed). [We believe that] if we make the call to action button from cart to checkout bigger and move it above the fold, then more people will click it because it will be more visible.

There are many benefits of having a good hypothesis, and the first one is that it will tell you what to measure as the outcome of the test. Here, it is clear that we will be measuring how many people move from cart to checkout. The hypothesis says so. "More people will click it" - the CTA button - so you know you're going to measure clicks and traffic moving from cart to checkout. A good hypothesis will state after the word 'then' what the measurable outcome should be.

In my chocolate example above, it's clear that eating choclate will make me either run faster or slower, so I'll be measuring my running speed. Neither hypothesis (the cart or the chocolate) has specified how big the change is. If I knew how big the change was going to be, I wouldn't test. Also, I haven't said how much more chocolate I'm going to eat, or how much faster I'll run, or how much bigger the CTA buttons should be, or how much more traffic I'll convert. That's the next step - the test execution. For now, the hypothesis is general enough to allow for the details to be decided later, but it frames the idea clearly and supports it with a reason why. Of course, the hypothesis may give some indication of the detailed measurements - I might be looking at increasing my consumption of chocolate by 100 g (about 4 oz) per day, and measuring my running speed over 100 metres (about 100 yds) every week.

Tests with a HIPPOthesis

The HIPPO, for reference, is the HIghest Paid Person's Opinion (or sometimes just the HIghest Paid PersOn). The boss. The management. Those who hold the budget control, who decide what's actionable, and who say what gets done. And sometimes, what they say is that, "You will test this". There's virtually no rationale, no data, no evidence or anything. Just a hunch (or even a whim) from the boss, who has a new idea that he likes. Perhaps he saw it on Amazon, or read about it in a blog, or his golf partner mentioned it on the course over the weekend. Whatever - here's the idea, and it's your job to go and test it.

These tests are likely to be completely variable in their design. They could be good ideas, bad ideas, mixed-up ideas or even amazing ideas. If you're going to run the test, however, you'll have to work out (or define for yourself) what the underlying hypothesis is. You'll also need to ask the HIPPO - very carefully - what the success metrics are. Be prepared to pitch this question somewhere between, "So, what are you trying to test?" and "Are you sure this is a productive use of the highly skilled people that you have working for you?" Any which way, you'll need the HIPPO to determine the success criteria, or agree to yours - in advance. If you don't, you'll end up with a disastrous recipe being declared a technical winner because it (1) increased time on page, (2) increased time on site or (3) drove more traffic to the Contact Us page, none of which were the intended success criteria for the test, or were agreed up-front, and which may not be good things anyway.

If you have to have to run a test with a HIPPOthesis, then write your own hypothesis and identify the metrics you're going to examine. You may also want to try and add one of your own recipes which you think will solve the apparent problem. But at the very least, nail down the metrics...

Tests with a hippiethesis
Hippie: noun
a person, especially of the late 1960s, who rejected established institutions and values and sought spontaneity, etc., etc. Also hippy

The final type of test idea is a hippiethesis - laid back, not too concerned with details, spontaneous and putting forward an idea it because it looks good on paper. "Let's test this because it's probably a good idea that will help improve site performance." Not as bad as the 'Test this!" that drives a HIPPOthesis, but not fully-formed as a hypothesis, the hippiethesis is probably (and I'm guessing) the most common type of test.

Some examples of hippietheses:

"If we make the product images better, then we'll improve conversion."
"The data shows we need to fix our conversion funnel - let's make the buttons blue instead of yellow."
"Let's copy Amazon because everybody knows they're the best online."

There's the basis of a good idea somewhere in there, but it's not quite finished. A hippiethesis will tell you that the lack of a good idea is not a problem, buddy, let's just test it - testing is cool (groovy?), man! The results will be awesome.

There's a laid-back approach to the test (either deliberate or accidental), where the idea has not been thought through - either because "You don't need all that science stuff", or because the evidence to support a test is very flimsy or even non-existent. Perhaps the test analyst didn't look for the evidence; perhaps he couldn't find any. Maybe the evidence is mostly there somewhere because everybody knows about it, but isn't actually documented. The danger here is that when you (or somebody else) start to analyse the results, you won't recall what you were testing for, what the main idea was or which metrics to look at. You'll end up analysing without purpose, trying to prove that the test was a good idea (and you'll have to do that before you can work out what it was that you were actually trying to prove in the first place).The main difference between a hypothesis and hippiethesis is the WHY. Online testing is a science, and scientists are curious people who ask why. Web analyst Avinash Kaushik calls it the three levels of so what test. If you can't get to something meaningful and useful, or in this case, testable and measureable, within three iterations of "Why?" then you're on the wrong track. Hippies don't bother with 'why' - that's too organised, formal and part of the system; instead, they'll test because they can, and because - as I said, testing is groovy.

A good hypothesis: IF, THEN, BECAUSE.

To wrap up: a good hypothesis needs three things: If (I make this change to the site) Then (I will expect this metric to improve) because (of a change in visitor behaviour that is linked to the change I made, based on evidence).

When there's no if: you aren't making a change to the site, you're just expecting things to happen by themselves. Crazy! If you reconsider my chocolate hypothesis, without the if, you're left with, "I will run faster and I will have more energy". Alternatively, "More people will click and we'll sell more." Not a very common attitude in testing, and more likely to be found in over-optimistic entrepreneurs :-)

When there's no then: If I eat more chocolate, I will have more energy. So what? And how will I measure this increased energy? There are no metrics here. Am I going to measure my heart rate, blood pressure, blood sugar level or body temperature?? In an online environment: will this improve conversion, revenue, bounce rate, exit rate, time on page, time on site or average number of pages per visit? I could measure any one of these and 'prove' the hypothesis. At its worst, a hypothesis without a 'then' would read as badly as, "If we make the CTA bigger, [then we will move more people to cart], [because] more people will click." which becomes "If we make the CTA bigger, more people will click." That's not a hypothesis, that's starting to state the absurdly obvious.

When there's no because: If I eat more chocolate, then I will run faster. Why? Why will I run faster? Will I run slower? How can I run even faster? There are metrics here (speed) but there's no reason why. The science is missing, and there's no way I can actually learn anything from this and improve. I will execute a one-off experiment and get a result, but I will be none the wiser about how it happened. Was it the sugar in the chocolate? Or the caffeine?

And finally, I should reiterate that an idea for a test doesn't have to be detailed, but it must be backed up by data (some, even if it's not great). The more evidence the better: think of a sliding scale from no evidence (could be a terrible idea), through to some evidence (a usability review, or a survey response, prior test result or some click-path analysis), through to multiple sources of evidence all pointing the same way - not just one or two data points, but a comprehensive case for change. You might even have enough evidence to make a go-do recommendation (and remember, it's a successful outcome if your evidence is strong enough to prompt the business to make a change without testing).

Wednesday, 3 July 2013

Getting an Online Testing Program Off The Ground

One of the unplanned topics from one of my xChange 2013 huddles was how to get an online testing program up and running, and how to build its momentum. We were discussing online testing more broadly, and this subject came up. Getting a test program up and running is not easy, but during our discussion a few useful hints and tips emerged, and I wanted to add to them here.

Sometimes, launching a test program is like defying gravity.

Image credit

Selling plain web analytics isn't easy, but once you have a reporting and analytics program up and running, and you're providing recommendations which are supported by data and seeing improvements in your site's performance, then the next step will probably be to propose and develop a test. Why test?

On the one hand, if your ideas and recommendations are being wholeheartedly received by the website's management team, then you may never need to resort to a test. If you can show with data (and other sources, such as survey responses or other voice-of-customer sources) that there's an issue on your site, and if you can use your reporting tools to show what the problem probably is - and then get the site changed based on your recommendations - and then see an improvement, then you don't need to test. Just implement!

However, you may find that you have a recommendation, backed by data, that doesn't quite get universal approval. How would the conversation go?

"The data shows that this page needs to be fixed - the issue is here, and the survey responses I've looked at show that the page needs a bigger/smaller product image."
"Hmm, I'm not convinced."
"Well, how about we try testing it then? If it wins, we can implement; if not, we can switch it off."
"How does that work, then?"

The ideal 'we love testing' management meeting. Image credit.

This is idealised, I know. But you get the idea, and then you can go on to explain the advantages of testing compared to having to implement and then roll back (when the sales figures go south).

The discussions we had during xChange showed that most testing programs were being initiated by the web analytics team - there were very few (or no) cases where management started the discussion or wanted to run a test. As web professionals, supporting a team with sales and performance targets, we need to be able to use all the online tools available to us - including testing - so it's important that we know how to sell testing to management, and get the resources that it needs. From management's perspective, analytics requires very little support or maintenance (compared to testing) - you tag the site (once, with occasional maintenance) and then pay any subscriptions to the web analytics provider, and pay for the staff (whether that's one member of staff or a small team). Then - that's it. No additional resource needed - no design, no specific IT, no JavaScript developers (except for the occasional tag change, maybe). And every week, the mysterious combination of analyst plus tags produces a report showing how sales and traffic figures went up, down or sideways.

By contrast, testing requires considerable resource. The design team will need to provide imagery and graphics, guidance on page design and so on. The JavaScript developers will need to put mboxes (or the test code) around the test area; the web content team will also need to understand the changes and make them as necessary. And that's just for one test. If you're planning to build up a test program (and you will be, in time) then you'll need to have the support teams available more frequently. So - what are the benefits of testing? And how do you sell them to management, when they're looking at the list of resources that you're asking for?

How to sell testing to management

1. Testing provides the opportunity to do that: test something that the business is already thinking of changing. A change of banners? A new page layout? As an analyst, you'll need to be ahead of the change curve to do this, and aware of changes before they happen, but if you get the opportunity then propose to test a new design before it goes live. This has the advantage of having most of the resource overhead already taken into account (you don't need to design the new banner/page) but it has one significant disadvantage: you're likely to find that there's a major bias towards the new design, and management may just go ahead and implement anyway, even if the test shows negative results for it.

2. A good track record of analytics wins will support your case for testing. You don't have to go back to prior analysis or recommendations and be as direct as, "I told you so," but something like, "The changes made following my analysis and recommendations on the checkout pages have led to an improvement in sales conversion of x%." is likely to be more persuasive. And this brings me neatly on to my next suggestion.

3. Your main aim in selling testing is to ensure you can get the money for testing resources, and for implementation. As I mentioned above, testing takes time, resource and expertise - or, to put it another way, money. So you'll need to persuade the people who hold the money that testing is a worthwhile investment. How? By showing a potential return on that investment.

"My previous recommendation was implemented and achieved a £1k per week increase in revenue. Additionally, if this test sees a 2% lift in conversion, that will be equal to £3k per week increase in revenue."

It's a bit of a gamble, as I've mentioned previously in discussing testing - you may not see a 2% lift in conversion, it may go flat or negative. But the main focus for the web channel management is going to be money: how can we use the site to make more money? And the answer is: by improving the site. And how do we know if we're improving the site? Because we're testing our ideas and showing that they're better than the previous version.

You do have the follow-up argument (if it does win), that, "If you don't implement this test win it will cost..." because there, you'll know exactly what the uplift is and you'll be able to present some useful financial data (assuming that yesterday's winner is not today's loser!). Talk about £, $ or Euros... sometimes, it's the only language that management speak.

4. Don't be afraid to carry out tests on the same part of a page. I know I've covered this previously - but it reduces your testing overhead, and it also forces you to iterate. It is possible to test the same part of a page without repeating yourself. You will need to have a test program, because you'll be testing on the same part of a page, and you'll need to consult your previous tests (winners, losers and flat results) to make sure you don't repeat them. And on the way, you'll have chance to look at why a test won, or didn't, and try to improve. That is iteration, and iteration is a key step from just testing to having a test program.

5. Don't be afraid to start by testing small areas of a page. Testing full-page redesigns is lengthy, laborious and risky. You can get plenty of testing mileage out of testing completely different designs for a small part of a page - a banner, an image, wording... remember that testing is a management expense for the time being, not an investment, and you'll need to keep your overheads low and have good potential returns (either financial, or learning, but remember that management's primary language is money).

6. Document everything! As much as possible - especially if you're only doing one or two tests at a time. Ask the code developers to explain what they've done, what worked, what issues they faced and how they overcame them. It may be all code to you, but in a few months' time, when you're talking to a different developer who is not familiar with testing and test code, your documentation may be the only thing that keeps your testing program moving.

Also - and I've mentioned this before - document your test designs and your results. Even if you're the only test analyst in your company, you'll need a reference library to work from, and one day, you might have a colleague or two and you'll need to show them what you've done before.

So, to wrap up - remember - it's not a problem if somebody agrees to implement a proposed test. "No, we won't test that, we'll implement it straight away." You made a compelling case for a change - subsequently, you (representing the data) and management (representing gut feeling and intuition) agreed on a course of action. Wins all round.

Setting up a testing program and getting management involvement requires some sales technique, not just data and analysis, so it's often outside the analyst's usual comfort zone. However, with the right approach to management (talk their language, show them the benefits) and a small but scalable approach to testing, you should - hopefully - be on the way to setting up a testing program, and then helping your testing program to gain momentum.

Similar posts I've written about online testing

How many of your tests win?

How long should I run my test for?

The Hierarchy of A/B Testing

Header tag