Thursday, 9 January 2014

When Good Tests Fail

Seth Godin, online usability expert recently stated simply that,  'The answer to the question, "What if I fail?" is "You will."  The real question is, "What after I fail?"'

Despite rigorous analytics, careful usability studies and thoughtful designing, the results from your latest A/B test are bad.  Conversion worsened; average order value plummeted and people bounced off your home page like it was a trampoline.  Your test failed.  And, if you're taking it personally (and most online professionals do take it very personally), then you failed too.

But, before the boss slashes your optimisation budget, you have the opportunity to rescue the test, by reviewing all the data and understanding the full picture.  Your test failed - but why?  I've mentioned before that tests which fail draw far more attention than those which win - it's just human nature to explore why something went wrong, and we like to attribute blame or responsibility accordingly.  That's why I pull apart my Chess games to find out why I lost.  I want to improve my Chess (I'm not going to stop playing, or fire myself from playing Chess).

So, the boss asks the questions- Why did your test fail?  (And it's suddenly stopped being his test, or our test... it's yours).  Where's the conversion uplift we expected?  And why aren't profits rising?

It's time to review the test plan, the hypothesis and the key questions. Which of these apply to your test?

Answer 1.  The hypothesis was not entirely valid. I have said before that, "If I eat more chocolate, I'll be able to run faster because I will have more energy."  What I failed to consider is the build up of fat in my body, and that eating all that chocolate has made me heavier, and hence I'm actually running more slowly.  I'm not training enough to convert all that fat into movement, and the energy is being stored as fat.

Or, in an online situation:  the idea was proved incorrect.  Somewhere, one of the assumptions that was made was wrong.  This is where the key test questions come in.  The analysis that comes from answering these key questions will help retrieve your test from 'total failure' to 'learning experience'.

Sometimes, in an online context, the change we made in the test had an unforeseen side-effect.  We thought we were driving more people from the product pages to the cart, but they just weren't properly prepared.  We had the button at the bottom of the page, and people who scrolled to the bottom of the page saw the full specs of the new super-toaster and how it needs an extra battery-pack for super-toasting.  We moved the button up the page, more people clicked on it, but realised only at the cart page that it needed the additional battery pack.  We upset more people than we helped, and overall conversion went down.

Unforeseen side-effects in testing leading to adverse performance: 
too much chocolate slows down 100m run times due to increased body mass
Answer 2.  The visual design of the test recipe didn't address the test hypothesis or the key test questions.  In any lab-based scientific experiment, you would expect to set up the apparatus and equipment and take specific measurements based on the experiment you were doing.  You would also set up the equipment to address the hypothesis - otherwise you're just messing about with lab equipment.  For example, if you wanted to measure the force of gravity and how it affects moving objects, you wouldn't design an experiment with a battery, a thermometer and a microphone. 

However, in an online environment, this sort of situation becomes possible, because different people possess the skills required to analyse data and the skills to design banners etc, and the skills to write the HTML or JavaScript code.  The analyst, the designer and the developer need to work closely together to make sure that the test design which hits the screen is going to answer the original hypothesis, and not something else that the designer believes will 'look nice' or that the developer finds easier to code.  Good collaboration between the key partners in the testing process is essential - if the original test idea doesn't meet brand guidelines, or is extremely difficult to code, then it's better to get everybody together and decide what can be done that will still help prove or disprove the hypothesis.

To give a final example from my chocolate-eating context, I wouldn't expect to prove that chocolate makes me run faster by eating crisps (potato chips) instead.  Unless they were chocolate-coated crips?  Seriously.

Answer 3.  Sometimes, the test design and execution was perfect, and we measured the right metrics in the right way.  However, the test data shows that our hypothesis was completely wrong.  It's time to learn something new...!

My hypothesis said that chocolate would make me run faster; but it didn't.  Now, I apologise that I'm not a biology expert and this probably isn't correct, but let's assume it is, review the 'data' and find out why.  

For a start, I put on weight (because chocolate contains fat), but worse still, the sugar in chocolate was also converted to fat, and it wasn't converted back into sugar quickly enough for me to benefit from it while running the 100 metres.  Measurements of my speed show I got slower, and measurements of my blood sugar levels before and after the 100 metres showed that the blood sugar levels fell, because the fat in my body wasn't converted into glucose and transferred to my muscles quickly enough.  Additionally, my body mass rose 3% during the testing period, and further analysis showed this was fat, not muscle.  This increased mass also slowed me down.

Back to online:  you thought people would like it if your product pages looked more like Apple's.  But Apple sell a limited range of products - one phone, one MP3 player, one desktop PC, etc. while you sell 15-20 of each of those, and your test recipe showed only one of your products on the page (the rest were hidden behind a 'View More' link), when you get better financial performance from a range of products.  Or perhaps you thought that prompting users to chat online would help them go through checkout... but you irritated them and put them off.  Perhaps your data showed that people kept leaving your site to talk to you on the phone.  However, when you tested hiding the phone number, in order to get people to convert online, you found that sales through the phone line went down, as expected, but your online sales also fell because people were using the phone line for help completing the online purchase.  There are learnings in all cases that you can use to improve your site further - you didn't fail, you just didn't win ;-)

In conclusion Yes, sometimes test recipes lose.  Hypotheses were incorrect, assumptions were invalid, side-effects were missed and sometimes the test just didn't ask the question it was meant to.  The difference between a test losing and a test failing is in the analysis, and that comes from planning - having a good hypothesis in the first place, and asking the right questions up front which will show why the test lost (or, let's not forget, the reason why a different test won).  Until then, fail fast and learn quickly!

Tuesday, 7 January 2014

The Key Questions in Online Testing

As you begin the process of designing an online test, the first thing you'll need is a solid test hypothesis.  My previous post outlined this, looking at a hypothesis, HIPPOthesis and hippiethesis.  To start with a quick recap, I explained that a good hypothesis says something like, "IF we make this change to our website, THEN we expect to see this improvement in performance BECAUSE we will have made it easier for visitors to complete their task."  Often, we have a good idea about what the test should be - make something bigger, have text in red instead of black... whatever.  

Stating the hypothesis in a formal way will help to draw the ideas together and give the test a clear purpose.  The exact details of the changes you're making in the test, the performance change you expect, and the reasons for the expected changes will be specific to each test, and that's where your web analytics data or usability studies will support your test idea.  For example, if you're seeing a large drop in traffic between the cart page and the checkout pages, and your usability study shows people aren't finding the 'continue' button, then your hypothesis will reflect this.

In between the test hypothesis and the test execution are the key questions.  These are the key questions that you will develop from your hypothesis, and which the test should answer.  They should tie very closely to the hypothesis, and they will direct the analysis of your test data, otherwise you'll have test data that will lack a focus and you'll struggle to tell the story of the test.  Think about what your test should show - what you'd like it to prove - and what you actually want to answer, in plain English.

Let's take my offline example from my previous post.  Here's my hypothesis:  "If I eat more chocolate, then I will be able to run faster because I will have more energy."

It's good - but only as a hypothesis (I'm not saying it's true, or accurate, but that's why we test!).  But before I start eating chocolate and then running, I need to confirm the exact details of how much chocolate, what distance and what times I can achieve at the moment.  If this was an ideal offline test, there would be two of me, one eating the chocolate, and one not.  And if it was ideal, I'd be the one eating the chocolate :-)

So, the key questions will start to drive the specifics of the test and the analysis.  In this case, the first key question is this:  "If I eat an additional 200 grams of chocolate each day, what will happen to my time for running the 100 metres sprint?"

It may be 200 grams or 300 grams; the 100m or the 200m, but in this case I've specified the mass of chocolate and the distance.  Demonstrating the 'will have more energy' will be a little harder to do.  In order to do this, I might add further questions, to help understand exactly what's happening during the test - perhaps questions around blood sugar levels, body mass, fat content, and so on.  Note at this stage that I haven't finalised the exact details - where I'll run the 100 metres, what form the chocolate will take (Snickers? Oreos? Mars?), and so on.  I could specify this information at this stage if I needed to, or I could write up a specific test execution plan as the next section of my test document.

In the online world I almost certainly will be looking at additional metrics - online measurements are rarely as straightforward as offline.  So let's take an online example and look at it in more detail.

"If I move the call-to-action button on the cart page to a position above the fold, then I will drive more people to start the checkout process because more people will see it and click on it."

And the key questions for my online test?

"How is the click-through rate for the CTA button affected by moving it above the fold?"
"How is overall cart-to-complete conversion affected by moving the button?"
"How are these two metrics affected if the button is near the top of the page or just above the fold?"

As you can see, the key questions specify exactly what's being changed - maybe not to the exact pixel, but they provide clear direction for the test execution.  They also make it clear what should be measured - in this case, there are two conversion rates (one at page level, one at visit level).  This is perhaps the key benefit of asking these core questions:  they drive you to the key metrics for the test.

"Yes, but we want to measure revenue and sales for our test."

Why?  Is your test meant to improve revenue and sales?  Or are you looking to reduce bounce rate on a landing page, or improve the consumption of learn content (whitepapers, articles, user reviews etc) on your site?  Of course, your site's reason-for-being is to general sales and revenue.  Your test data may show a knock-on improvement on revenue and sales, and yes, you'll want to make sure that these vital site-wide metrics don't fall off a cliff while you're testing, but if your hypothesis says, "This change should improve home page bounce rate because..." then I propose that it makes sense to measure bounce rate as the primary metric for the test success.  I also suspect that you can quickly tie bounce rate to a financial metric through some web analytics - after all, I doubt that anyone would think of trying to improve bounce rate without some view of how much a successful visitor generates.

So:  having written a valid hypothesis which is backed by analysis, usability or other data (and not just a go-test-this mentality from the boss), you are now ready to address the critical questions for the test.  These will typically be, "How much....?" and "How does XYZ change when...?" questions that will focus the analysis of the test results, and will also lead you very quickly to the key metrics for the test (which may or may not be money-related).

I am not proposing to pack away an extra 100 grams of chocolate per day and start running the 100 metres.  It's rained here every day since Christmas and I'm really not that dedicated to running.  I might, instead, start on an extra 100 grams of chocolate and measure my body mass, blood cholesterol and fat content.  All in the name of science, you understand. :-)

Monday, 6 January 2014

Chess: King's Gambit 1. e4 e5 2. f4

After my most recent post, where I played as Black against the Bird Opening 1. f4, in this post I'd like to cover another game where I again faced White playing an f4 opening - in this case, the King's Gambit.  I played this against one of my Kidsgrove Chess Club team mates, and this time, I lost.  Badly.  I had just suffered a difficult 32-move defeat against the team's top player, and I will make the excuse that I wasn't playing my best here.  I will cover that game in a later blog post.

I'm covering this game here, because interestingly, in the previous game where I played against 1. f4, I won by playing Qh4+ early on in the game.  If I'd been more aware, I might have seen it here, too.

David Johnson vs Dave Leese, 17 December 2013 Kidsgrove Chess Club (Friendly)

1.e4 e5
2.f4 Nc6

It's unusual to have a critical point in a game so early, but here it is.  I played the natural recapture with Nxe5 and slowly got into all sorts of trouble.  I missed the Qh4+ move that I played (and won with) just two weeks earlier.

3. ... Nxe5 ( 3...Qh4+ )

Let's look at 3. ... Qh4+ before reviewing the actual game in full. There are two replies - to move the King to e2 or to block with a Pawn on g3.

3. ... Qh4+
4.  Ke2 Qxe4+
5.  Kf2 Bc5+

If then 6. Kg3, then Chessmaster 9000 (my preferred analysis tool) is already giving a mate in 8, starting with 6. ... h5.  The other option is 6. d4 and this is going to mean a significant loss of material - 6 ... Bxd4+ 7. Qxd4 Qxd4+ and White has delayed the inevitable at the cost of his Queen.

However, I completely missed this overwhelming attack, and instead went through a painful game where I fell into all sorts of trouble.  Let's resume after 3. ... Nxe5

4.d4 Ng6 (I could still have played ... Qh4+, or ... Bb4+ at this point).
5.Nf3 d6 (there are no more chances of ... Qh4+ now, and Chessmaster recommends ... d5).
6.Bc4 h6 (preventing Ng5 and an attack on f7 - in theory, anyway).
7.O-O Bg4?  (a big mistake, as we shall see.  Better was Nf6 or Be6... something - anything - to protect f7.  White's decision to castle was not just a natural move at this stage, it moved the Rook onto the semi-open and dangerous f-file).
The position after 7. ... Bg4 and immediately before Bxf7+!
8.Bxf7+! Ke7 ( not 8...Kxf7 which leads to Ng5++ and Nf7 forking Queen and Rook)
9.Bxg6 Nf6 (finally!)
10.Nc3 c6 (opening a diagonal for my Queen, and providing space for my King)
11.Qe1 Kd7 (taking advantage of the slow pace of the game to improve my King and Queen)
12.e5 Nd5
13.Nxd5 cxd5
White played Qg3? and missed exd6 with a large material gain.
14.Qg3? Be6 (my wayward Bishop finally gets a decent square, even if the Pawn on f7 is gone)
15.Nh4 Qb6 (developing, and attacking the newly-unprotected d4 Pawn)
16.c3 Be7 (perhaps ... dxe5 was better)
17.Nf5 Rhf8
18.Nxe7 Rxf1+
19.Kxf1 Rf8+ (No, I'm not sure why I threw this in.  I needed as many pieces as possible on the board, but at least I got rid of White's active Rook in exchange for my inactive one).
20.Kg1 Kxe7
21.exd6+ Kd7
22.Qe5 Qxd6 (offering an exchange, but also protecting the Rook on f8).
23.Qxg7+ Kc6
24.Qe5 Qxe5
25.dxe5 Rg8
( 25...Bh3 26.gxh3 Rg8 27.Bf4 Rxg6+ )
26.Bh5 Bh3
27.Bf3 Bf5
28.Bxh6 1-0

The final position.  I've run out of ideas, I'm a piece and three pawns down, and I've had enough!  Seeing afterwards that I missed several opportunities for a massive attack in the first few moves, has made me more confident in my attacking options, and how I missed my opponent's attack developing (in this game and the previous one) has made me even more aware of the need to defend accurately too.  Yes I put up a fight, but really I was defending a lost cause due to some daft blunders.  On with the next game!

Friday, 3 January 2014

Chess: Bird Opening 1 f4 2 e6

After a long break from Chess writing, I'm returning with an analysis of a game I recently played face-to-face.  I've played almost exclusively online for a number of years, and recently decided it was time to start playing 'real' people.  I've joined Kidsgrove Chess Club, and signed up to the English Chess Federation in order to obtain a ranking through the games I play.

My first game in this new face-to-face era is a friendly - the rest of the club were involved in a match against Stafford Chess Club, and I was able to play against one of the Stafford team after he completed his game.

I played Black, and for possibly the first time ever, I faced the Bird Opening, 1 f4.  My reply was 1. ... e6, intending at some early point to get my Queen to h4 and deliver a potentially uncomfortable check. 

David Barker vs Dave Leese, 4 December 2013, Kidsgrove Chess Club (Home, Friendly)

1.f4 e6
2.b3 d5
3.Bb2 Nf6
4.e3 c5

If there is space in the centre to be claimed, I'll claim it.  My opponent seems to be playing for a very slow build-up.  I am resisting my urge to play my normal, attacking game and am playing cautiously - after all, I don't want to lose quickly in my very first game in front of my new team-mates. Also, I am concerned about the white Bishop on b2 and the way it looks into my kingside - I brought out my Knight to f3 in order to liberate my Bishop from having to defend h7.  In time, I may play d4 and look to shut White's Bishop out of the game.

5.Nf3 Nc6
6.Be2 Bd6

 In this position, I opted to play 7. ... O-O.  I don't want to capture on d4 - White could recapture with his Bishop or his Knight and start to develop a grip on the centre.  Additionally, I also have a brief tactic developing where I begin to attack the now-backward pawn on e3.  Here's what I was thinking at that time:

7. ... O-O
8. dxc5  Bxc5 attacking e3
9. Nd4  Nxd4

Now after:  10. exd4 White has weakened pawns, and I can attack them with 10. ... Bd6 11.  O-O Qc7, and I have the options of moving my Knight on f6 to an even better square.

The position after my theoretical 11. ... Qc7
However, the game didn't proceed that way at all, and we resume the game after my move 7. ... O-O

8.  Ne5?  Ne4

I was very surprised by my opponent's decision to play Ne5.  This Knight is the only piece protecting the dark squares on the kingside and preventing me from getting in a Qh4+ and hopefully initiating a king-side attack at some point.  I moved my Knight to e4 in order to open the diagonal for my Queen, and also to take a look at f2, the weak spot in white's position.  It's also a great outpost for my Knight, as my opponent has played f4 and d4.

9.  Nd2  Qh4+  0-1

I suspect my opponent saw the threat of my Knight on e4, but missed the Queen check, and after a moment's thought, resigned the game.  The immediate threat is 10. g3 Nxg3 11. Rg1 Qxh2 or 11. Ndf3 Qh5 which wins a pawn and starts a moderate king-side attack (preventing White from castling king-side and causing longer term complications).  I was surprised at the early resignation, but pleased that my first face-to-face game in front of my new team-mates was a win.

The position after 9 ... Qh4+ and White resigned.

We took back White's ninth move, and instead White played 9. O-O, and play continued:

9.  O-O  f6?
10.Nxc6 bxc6

11.Nd2 Nxd2  (I'm not sure I should have given up my Knight on this great outpost, but I suspect my opponent would have swapped them off anyway).
12.Qxd2 cxd4

I took this pawn in order to straighten out my own doubled pawns.

13.Bxd4 c5
14.Bb2 Bb7
15.Rad1 Qb6?

I missed the gathering threat on the d-file.  I moved my Queen to an attacking position, planning to advance my c-pawn and expose an attack on the diagonal to the King, and missed the attack on my own d-pawn. Following this, I got into a potentially very messy position where I could have lost at least a pawn.

16.c4 Qc6
17.Bf3 Qa6
18.Bc3 Bc6
19. e4?

After a lot of shuffling around (which surprised me, I was sure my position was going to fall apart) my opponent played 19. e4.  I was lining up my Bishop and Queen so that I could re-capture on d4 with my Bishop before my Queen.  One benefit of 17... Qa6 was the attack on a2, which required a defence, but otherwise, I was scrambling around for acceptable moves until this point.  19. e4 gave me the chance I needed to reinforce my position and get out of trouble, and I played this move very quickly.

19.... d4
20.Ba1 Rab8
(moving onto the semi-open file)
21.Qe2? Bxf4
22.Bg4 Be3+
23.Kh1 Bxe4

My opponent later said he wasn't having a great night - he'd previously played another member of the Kidsgrove team and lost, and I guess he was becoming tired.  Or just having a bad day, but I had moved two pawns ahead through two blunders (although 22. ... Be3+ is one of my favourite moves of this game).  After 23 ... Bxe4 I was two pawns up and they were connected passed pawns on the d- and e- files.  There were a few exchanges made as I started to trade off pieces, then I began advancing my passed pawns and getting my rooks involved.  After move 29, my opponent resigned, as I completed my defence and started looking to push my passed pawns.

24.Bf3 Bxf3
25.Rxf3 e5
26.Rg3 Rfe8
27.Qc2 e4
28.Qe2 Bf4
29.Rh3 Rbd8 0-1

The final position, after 29 Rbd8.

Not a perfect game, and I made a few blunders throughout, but held together and took the opportunities.  We discussed possible continuations, and Qh5 and Qxc5 look like a good start for White, but Black can reply with Qxa2, and Black's passed pawns present a continued threat.  All in all, an interesting game, and an enjoyable return to face-to-face Chess.