Web Optimisation, Maths and Puzzles

Tuesday, 7 January 2014

The Key Questions in Online Testing

As you begin the process of designing an online test, the first thing you'll need is a solid test hypothesis. My previous post outlined this, looking at a hypothesis, HIPPOthesis and hippiethesis. To start with a quick recap, I explained that a good hypothesis says something like, "IF we make this change to our website, THEN we expect to see this improvement in performance BECAUSE we will have made it easier for visitors to complete their task." Often, we have a good idea about what the test should be - make something bigger, have text in red instead of black... whatever.

Stating the hypothesis in a formal way will help to draw the ideas together and give the test a clear purpose. The exact details of the changes you're making in the test, the performance change you expect, and the reasons for the expected changes will be specific to each test, and that's where your web analytics data or usability studies will support your test idea. For example, if you're seeing a large drop in traffic between the cart page and the checkout pages, and your usability study shows people aren't finding the 'continue' button, then your hypothesis will reflect this.

In between the test hypothesis and the test execution are the key questions. These are the key questions that you will develop from your hypothesis, and which the test should answer. They should tie very closely to the hypothesis, and they will direct the analysis of your test data, otherwise you'll have test data that will lack a focus and you'll struggle to tell the story of the test. Think about what your test should show - what you'd like it to prove - and what you actually want to answer, in plain English.

Let's take my offline example from my previous post. Here's my hypothesis: "If I eat more chocolate, then I will be able to run faster because I will have more energy."

It's good - but only as a hypothesis (I'm not saying it's true, or accurate, but that's why we test!). But before I start eating chocolate and then running, I need to confirm the exact details of how much chocolate, what distance and what times I can achieve at the moment. If this was an ideal offline test, there would be two of me, one eating the chocolate, and one not. And if it was ideal, I'd be the one eating the chocolate :-)

So, the key questions will start to drive the specifics of the test and the analysis. In this case, the first key question is this: "If I eat an additional 200 grams of chocolate each day, what will happen to my time for running the 100 metres sprint?"

It may be 200 grams or 300 grams; the 100m or the 200m, but in this case I've specified the mass of chocolate and the distance. Demonstrating the 'will have more energy' will be a little harder to do. In order to do this, I might add further questions, to help understand exactly what's happening during the test - perhaps questions around blood sugar levels, body mass, fat content, and so on. Note at this stage that I haven't finalised the exact details - where I'll run the 100 metres, what form the chocolate will take (Snickers? Oreos? Mars?), and so on. I could specify this information at this stage if I needed to, or I could write up a specific test execution plan as the next section of my test document.

In the online world I almost certainly will be looking at additional metrics - online measurements are rarely as straightforward as offline. So let's take an online example and look at it in more detail.

"If I move the call-to-action button on the cart page to a position above the fold, then I will drive more people to start the checkout process because more people will see it and click on it."

And the key questions for my online test?

"How is the click-through rate for the CTA button affected by moving it above the fold?"
"How is overall cart-to-complete conversion affected by moving the button?"
"How are these two metrics affected if the button is near the top of the page or just above the fold?"

As you can see, the key questions specify exactly what's being changed - maybe not to the exact pixel, but they provide clear direction for the test execution. They also make it clear what should be measured - in this case, there are two conversion rates (one at page level, one at visit level). This is perhaps the key benefit of asking these core questions: they drive you to the key metrics for the test.

"Yes, but we want to measure revenue and sales for our test."

Why? Is your test meant to improve revenue and sales? Or are you looking to reduce bounce rate on a landing page, or improve the consumption of learn content (whitepapers, articles, user reviews etc) on your site? Of course, your site's reason-for-being is to general sales and revenue. Your test data may show a knock-on improvement on revenue and sales, and yes, you'll want to make sure that these vital site-wide metrics don't fall off a cliff while you're testing, but if your hypothesis says, "This change should improve home page bounce rate because..." then I propose that it makes sense to measure bounce rate as the primary metric for the test success. I also suspect that you can quickly tie bounce rate to a financial metric through some web analytics - after all, I doubt that anyone would think of trying to improve bounce rate without some view of how much a successful visitor generates.

So: having written a valid hypothesis which is backed by analysis, usability or other data (and not just a go-test-this mentality from the boss), you are now ready to address the critical questions for the test. These will typically be, "How much....?" and "How does XYZ change when...?" questions that will focus the analysis of the test results, and will also lead you very quickly to the key metrics for the test (which may or may not be money-related).

I am not proposing to pack away an extra 100 grams of chocolate per day and start running the 100 metres. It's rained here every day since Christmas and I'm really not that dedicated to running. I might, instead, start on an extra 100 grams of chocolate and measure my body mass, blood cholesterol and fat content. All in the name of science, you understand. :-)

Monday, 6 January 2014

Chess: King's Gambit 1. e4 e5 2. f4

After my most recent post, where I played as Black against the Bird Opening 1. f4, in this post I'd like to cover another game where I again faced White playing an f4 opening - in this case, the King's Gambit. I played this against one of my Kidsgrove Chess Club team mates, and this time, I lost. Badly. I had just suffered a difficult 32-move defeat against the team's top player, and I will make the excuse that I wasn't playing my best here. I will cover that game in a later blog post.

I'm covering this game here, because interestingly, in the previous game where I played against 1. f4, I won by playing Qh4+ early on in the game. If I'd been more aware, I might have seen it here, too.

David Johnson vs Dave Leese, 17 December 2013 Kidsgrove Chess Club (Friendly)

1.e4 e5
2.f4 Nc6
3.fxe5

It's unusual to have a critical point in a game so early, but here it is. I played the natural recapture with Nxe5 and slowly got into all sorts of trouble. I missed the Qh4+ move that I played (and won with) just two weeks earlier.

3. ... Nxe5 ( 3...Qh4+ )

Let's look at 3. ... Qh4+ before reviewing the actual game in full. There are two replies - to move the King to e2 or to block with a Pawn on g3.

3. ... Qh4+
4. Ke2 Qxe4+
5. Kf2 Bc5+

If then 6. Kg3, then Chessmaster 9000 (my preferred analysis tool) is already giving a mate in 8, starting with 6. ... h5. The other option is 6. d4 and this is going to mean a significant loss of material - 6 ... Bxd4+ 7. Qxd4 Qxd4+ and White has delayed the inevitable at the cost of his Queen.

However, I completely missed this overwhelming attack, and instead went through a painful game where I fell into all sorts of trouble. Let's resume after 3. ... Nxe5

4.d4 Ng6 (I could still have played ... Qh4+, or ... Bb4+ at this point).
5.Nf3 d6 (there are no more chances of ... Qh4+ now, and Chessmaster recommends ... d5).
6.Bc4 h6 (preventing Ng5 and an attack on f7 - in theory, anyway).
7.O-O Bg4? (a big mistake, as we shall see. Better was Nf6 or Be6... something - anything - to protect f7. White's decision to castle was not just a natural move at this stage, it moved the Rook onto the semi-open and dangerous f-file).

The position after 7. ... Bg4 and immediately before Bxf7+!

8.Bxf7+! Ke7 ( not 8...Kxf7 which leads to Ng5++ and Nf7 forking Queen and Rook)
9.Bxg6 Nf6 (finally!)
10.Nc3 c6 (opening a diagonal for my Queen, and providing space for my King)
11.Qe1 Kd7 (taking advantage of the slow pace of the game to improve my King and Queen)
12.e5 Nd5
13.Nxd5 cxd5

White played Qg3? and missed exd6 with a large material gain.

14.Qg3? Be6 (my wayward Bishop finally gets a decent square, even if the Pawn on f7 is gone)
15.Nh4 Qb6 (developing, and attacking the newly-unprotected d4 Pawn)
16.c3 Be7 (perhaps ... dxe5 was better)
17.Nf5 Rhf8
18.Nxe7 Rxf1+
19.Kxf1 Rf8+ (No, I'm not sure why I threw this in. I needed as many pieces as possible on the board, but at least I got rid of White's active Rook in exchange for my inactive one).
20.Kg1 Kxe7
21.exd6+ Kd7
22.Qe5 Qxd6 (offering an exchange, but also protecting the Rook on f8).
23.Qxg7+ Kc6
24.Qe5 Qxe5
25.dxe5 Rg8
( 25...Bh3 26.gxh3 Rg8 27.Bf4 Rxg6+ )
26.Bh5 Bh3
27.Bf3 Bf5
28.Bxh6 1-0

The final position. I've run out of ideas, I'm a piece and three pawns down, and I've had enough! Seeing afterwards that I missed several opportunities for a massive attack in the first few moves, has made me more confident in my attacking options, and how I missed my opponent's attack developing (in this game and the previous one) has made me even more aware of the need to defend accurately too. Yes I put up a fight, but really I was defending a lost cause due to some daft blunders. On with the next game!

Friday, 3 January 2014

Chess: Bird Opening 1 f4 2 e6

After a long break from Chess writing, I'm returning with an analysis of a game I recently played face-to-face. I've played almost exclusively online for a number of years, and recently decided it was time to start playing 'real' people. I've joined Kidsgrove Chess Club, and signed up to the English Chess Federation in order to obtain a ranking through the games I play.

My first game in this new face-to-face era is a friendly - the rest of the club were involved in a match against Stafford Chess Club, and I was able to play against one of the Stafford team after he completed his game.

I played Black, and for possibly the first time ever, I faced the Bird Opening, 1 f4. My reply was 1. ... e6, intending at some early point to get my Queen to h4 and deliver a potentially uncomfortable check.

David Barker vs Dave Leese, 4 December 2013, Kidsgrove Chess Club (Home, Friendly)

1.f4 e6
2.b3 d5
3.Bb2 Nf6
4.e3 c5

If there is space in the centre to be claimed, I'll claim it. My opponent seems to be playing for a very slow build-up. I am resisting my urge to play my normal, attacking game and am playing cautiously - after all, I don't want to lose quickly in my very first game in front of my new team-mates. Also, I am concerned about the white Bishop on b2 and the way it looks into my kingside - I brought out my Knight to f3 in order to liberate my Bishop from having to defend h7. In time, I may play d4 and look to shut White's Bishop out of the game.

5.Nf3 Nc6
6.Be2 Bd6
7.d4

In this position, I opted to play 7. ... O-O. I don't want to capture on d4 - White could recapture with his Bishop or his Knight and start to develop a grip on the centre. Additionally, I also have a brief tactic developing where I begin to attack the now-backward pawn on e3. Here's what I was thinking at that time:

7. ... O-O
8. dxc5 Bxc5 attacking e3
9. Nd4 Nxd4

Now after: 10. exd4 White has weakened pawns, and I can attack them with 10. ... Bd6 11. O-O Qc7, and I have the options of moving my Knight on f6 to an even better square.

The position after my theoretical 11. ... Qc7

However, the game didn't proceed that way at all, and we resume the game after my move 7. ... O-O

8. Ne5? Ne4

I was very surprised by my opponent's decision to play Ne5. This Knight is the only piece protecting the dark squares on the kingside and preventing me from getting in a Qh4+ and hopefully initiating a king-side attack at some point. I moved my Knight to e4 in order to open the diagonal for my Queen, and also to take a look at f2, the weak spot in white's position. It's also a great outpost for my Knight, as my opponent has played f4 and d4.

9. Nd2 Qh4+ 0-1

I suspect my opponent saw the threat of my Knight on e4, but missed the Queen check, and after a moment's thought, resigned the game. The immediate threat is 10. g3 Nxg3 11. Rg1 Qxh2 or 11. Ndf3 Qh5 which wins a pawn and starts a moderate king-side attack (preventing White from castling king-side and causing longer term complications). I was surprised at the early resignation, but pleased that my first face-to-face game in front of my new team-mates was a win.

The position after 9 ... Qh4+ and White resigned.

We took back White's ninth move, and instead White played 9. O-O, and play continued:

9. O-O f6?
10.Nxc6 bxc6
11.Nd2 Nxd2 (I'm not sure I should have given up my Knight on this great outpost, but I suspect my opponent would have swapped them off anyway).
12.Qxd2 cxd4

I took this pawn in order to straighten out my own doubled pawns.

13.Bxd4 c5
14.Bb2 Bb7
15.Rad1 Qb6?

I missed the gathering threat on the d-file. I moved my Queen to an attacking position, planning to advance my c-pawn and expose an attack on the diagonal to the King, and missed the attack on my own d-pawn. Following this, I got into a potentially very messy position where I could have lost at least a pawn.

16.c4 Qc6
17.Bf3 Qa6
18.Bc3 Bc6
19. e4?

After a lot of shuffling around (which surprised me, I was sure my position was going to fall apart) my opponent played 19. e4. I was lining up my Bishop and Queen so that I could re-capture on d4 with my Bishop before my Queen. One benefit of 17... Qa6 was the attack on a2, which required a defence, but otherwise, I was scrambling around for acceptable moves until this point. 19. e4 gave me the chance I needed to reinforce my position and get out of trouble, and I played this move very quickly.

19.... d4
20.Ba1 Rab8 (moving onto the semi-open file)
21.Qe2? Bxf4
22.Bg4 Be3+
23.Kh1 Bxe4

My opponent later said he wasn't having a great night - he'd previously played another member of the Kidsgrove team and lost, and I guess he was becoming tired. Or just having a bad day, but I had moved two pawns ahead through two blunders (although 22. ... Be3+ is one of my favourite moves of this game). After 23 ... Bxe4 I was two pawns up and they were connected passed pawns on the d- and e- files. There were a few exchanges made as I started to trade off pieces, then I began advancing my passed pawns and getting my rooks involved. After move 29, my opponent resigned, as I completed my defence and started looking to push my passed pawns.

24.Bf3 Bxf3
25.Rxf3 e5
26.Rg3 Rfe8
27.Qc2 e4
28.Qe2 Bf4
29.Rh3 Rbd8 0-1

The final position, after 29 Rbd8.

Not a perfect game, and I made a few blunders throughout, but held together and took the opportunities. We discussed possible continuations, and Qh5 and Qxc5 look like a good start for White, but Black can reply with Qxa2, and Black's passed pawns present a continued threat. All in all, an interesting game, and an enjoyable return to face-to-face Chess.

I have played some better games though, and I can recommend these:

Playing the English Defence
My first face-to-face club game
My earliest online Chess game
My very earliest Chess game (it was even earlier than I thought)
The Chess game I'm most proud of - where I made the situation too complicated for my opponent, causing him to lose a piece; I then found a fork and finished off with a piece sacrifice

Wednesday, 24 July 2013

The Science of A Good Hypothesis

Good testing requires many things: good design, good execution, good planning. Most important is a good idea - or a good hypothesis, but many people jump into testing without a good reason for testing. After all, testing is cool, it's capable of fixing all my online woes, and it'll produce huge improvements to my online sales, won't it?

I've talked before about good testing, and, "Let's test this and see if it works," is an example of poor test planning. A good idea, backed up with evidence (data, or usability testing, or other valid evidence) is more likely to lead to a good result. This is the basis of a hypothesis, and a good hypothesis is the basis of a good test.

What makes a good hypothesis? What, and why.

According to Wiki Answers, a hypothesis is, "An educated guess about the cause of some observed (seen, noticed, or otherwise perceived) phenomena, and what seems most likely to happen and why. It is a more scientific method of guessing or assuming what is going to happen."

In simple, testing terms, a hypothesis states what you are going to test (or change) on a page, what the effect of the change will be, and why the effect will occur. To put it another way, a hypothesis is an "If ... then... because..." statement. "If I eat lots of chocolate, then I will run more slowly because I will put on weight." Or, alternatively, "If I eat lots of chocolate, then I will run faster because I will have more energy." (I wish).

However, not all online tests are born equal, and you could probably place the majority of them into one of three groups, based on the strength of the original theory. These are tests with a hypothesis, tests with a HIPPOthesis and tests with a hippiethesis.

Tests with a hypothesis

These are arguably the hardest tests to set up. A good hypothesis will rely on the test analyst sitting down with data, evidence and experience (or two out of three) and working out what the data is saying. For example, the 'what' could be that you're seeing a 93% drop-off between the cart and the first checkout page. Why? Well, the data shows that people are going back to the home page, or the product description page. Why? Well, because the call-to-action button to start checkout is probably not clear enough. Or we aren't confirming the total cost to the customer. Or the button is below the fold.

So, you need to change the page - and let's take the button issue as an example for our hypothesis. People are not progressing from cart to checkout very well (only 7% proceed). [We believe that] if we make the call to action button from cart to checkout bigger and move it above the fold, then more people will click it because it will be more visible.

There are many benefits of having a good hypothesis, and the first one is that it will tell you what to measure as the outcome of the test. Here, it is clear that we will be measuring how many people move from cart to checkout. The hypothesis says so. "More people will click it" - the CTA button - so you know you're going to measure clicks and traffic moving from cart to checkout. A good hypothesis will state after the word 'then' what the measurable outcome should be.

In my chocolate example above, it's clear that eating choclate will make me either run faster or slower, so I'll be measuring my running speed. Neither hypothesis (the cart or the chocolate) has specified how big the change is. If I knew how big the change was going to be, I wouldn't test. Also, I haven't said how much more chocolate I'm going to eat, or how much faster I'll run, or how much bigger the CTA buttons should be, or how much more traffic I'll convert. That's the next step - the test execution. For now, the hypothesis is general enough to allow for the details to be decided later, but it frames the idea clearly and supports it with a reason why. Of course, the hypothesis may give some indication of the detailed measurements - I might be looking at increasing my consumption of chocolate by 100 g (about 4 oz) per day, and measuring my running speed over 100 metres (about 100 yds) every week.

Tests with a HIPPOthesis

The HIPPO, for reference, is the HIghest Paid Person's Opinion (or sometimes just the HIghest Paid PersOn). The boss. The management. Those who hold the budget control, who decide what's actionable, and who say what gets done. And sometimes, what they say is that, "You will test this". There's virtually no rationale, no data, no evidence or anything. Just a hunch (or even a whim) from the boss, who has a new idea that he likes. Perhaps he saw it on Amazon, or read about it in a blog, or his golf partner mentioned it on the course over the weekend. Whatever - here's the idea, and it's your job to go and test it.

These tests are likely to be completely variable in their design. They could be good ideas, bad ideas, mixed-up ideas or even amazing ideas. If you're going to run the test, however, you'll have to work out (or define for yourself) what the underlying hypothesis is. You'll also need to ask the HIPPO - very carefully - what the success metrics are. Be prepared to pitch this question somewhere between, "So, what are you trying to test?" and "Are you sure this is a productive use of the highly skilled people that you have working for you?" Any which way, you'll need the HIPPO to determine the success criteria, or agree to yours - in advance. If you don't, you'll end up with a disastrous recipe being declared a technical winner because it (1) increased time on page, (2) increased time on site or (3) drove more traffic to the Contact Us page, none of which were the intended success criteria for the test, or were agreed up-front, and which may not be good things anyway.

If you have to have to run a test with a HIPPOthesis, then write your own hypothesis and identify the metrics you're going to examine. You may also want to try and add one of your own recipes which you think will solve the apparent problem. But at the very least, nail down the metrics...

Tests with a hippiethesis
Hippie: noun
a person, especially of the late 1960s, who rejected established institutions and values and sought spontaneity, etc., etc. Also hippy

The final type of test idea is a hippiethesis - laid back, not too concerned with details, spontaneous and putting forward an idea it because it looks good on paper. "Let's test this because it's probably a good idea that will help improve site performance." Not as bad as the 'Test this!" that drives a HIPPOthesis, but not fully-formed as a hypothesis, the hippiethesis is probably (and I'm guessing) the most common type of test.

Some examples of hippietheses:

"If we make the product images better, then we'll improve conversion."
"The data shows we need to fix our conversion funnel - let's make the buttons blue instead of yellow."
"Let's copy Amazon because everybody knows they're the best online."

There's the basis of a good idea somewhere in there, but it's not quite finished. A hippiethesis will tell you that the lack of a good idea is not a problem, buddy, let's just test it - testing is cool (groovy?), man! The results will be awesome.

There's a laid-back approach to the test (either deliberate or accidental), where the idea has not been thought through - either because "You don't need all that science stuff", or because the evidence to support a test is very flimsy or even non-existent. Perhaps the test analyst didn't look for the evidence; perhaps he couldn't find any. Maybe the evidence is mostly there somewhere because everybody knows about it, but isn't actually documented. The danger here is that when you (or somebody else) start to analyse the results, you won't recall what you were testing for, what the main idea was or which metrics to look at. You'll end up analysing without purpose, trying to prove that the test was a good idea (and you'll have to do that before you can work out what it was that you were actually trying to prove in the first place).The main difference between a hypothesis and hippiethesis is the WHY. Online testing is a science, and scientists are curious people who ask why. Web analyst Avinash Kaushik calls it the three levels of so what test. If you can't get to something meaningful and useful, or in this case, testable and measureable, within three iterations of "Why?" then you're on the wrong track. Hippies don't bother with 'why' - that's too organised, formal and part of the system; instead, they'll test because they can, and because - as I said, testing is groovy.

A good hypothesis: IF, THEN, BECAUSE.

To wrap up: a good hypothesis needs three things: If (I make this change to the site) Then (I will expect this metric to improve) because (of a change in visitor behaviour that is linked to the change I made, based on evidence).

When there's no if: you aren't making a change to the site, you're just expecting things to happen by themselves. Crazy! If you reconsider my chocolate hypothesis, without the if, you're left with, "I will run faster and I will have more energy". Alternatively, "More people will click and we'll sell more." Not a very common attitude in testing, and more likely to be found in over-optimistic entrepreneurs :-)

When there's no then: If I eat more chocolate, I will have more energy. So what? And how will I measure this increased energy? There are no metrics here. Am I going to measure my heart rate, blood pressure, blood sugar level or body temperature?? In an online environment: will this improve conversion, revenue, bounce rate, exit rate, time on page, time on site or average number of pages per visit? I could measure any one of these and 'prove' the hypothesis. At its worst, a hypothesis without a 'then' would read as badly as, "If we make the CTA bigger, [then we will move more people to cart], [because] more people will click." which becomes "If we make the CTA bigger, more people will click." That's not a hypothesis, that's starting to state the absurdly obvious.

When there's no because: If I eat more chocolate, then I will run faster. Why? Why will I run faster? Will I run slower? How can I run even faster? There are metrics here (speed) but there's no reason why. The science is missing, and there's no way I can actually learn anything from this and improve. I will execute a one-off experiment and get a result, but I will be none the wiser about how it happened. Was it the sugar in the chocolate? Or the caffeine?

And finally, I should reiterate that an idea for a test doesn't have to be detailed, but it must be backed up by data (some, even if it's not great). The more evidence the better: think of a sliding scale from no evidence (could be a terrible idea), through to some evidence (a usability review, or a survey response, prior test result or some click-path analysis), through to multiple sources of evidence all pointing the same way - not just one or two data points, but a comprehensive case for change. You might even have enough evidence to make a go-do recommendation (and remember, it's a successful outcome if your evidence is strong enough to prompt the business to make a change without testing).

Wednesday, 3 July 2013

Getting an Online Testing Program Off The Ground

One of the unplanned topics from one of my xChange 2013 huddles was how to get an online testing program up and running, and how to build its momentum. We were discussing online testing more broadly, and this subject came up. Getting a test program up and running is not easy, but during our discussion a few useful hints and tips emerged, and I wanted to add to them here.

Sometimes, launching a test program is like defying gravity.

Image credit

Selling plain web analytics isn't easy, but once you have a reporting and analytics program up and running, and you're providing recommendations which are supported by data and seeing improvements in your site's performance, then the next step will probably be to propose and develop a test. Why test?

On the one hand, if your ideas and recommendations are being wholeheartedly received by the website's management team, then you may never need to resort to a test. If you can show with data (and other sources, such as survey responses or other voice-of-customer sources) that there's an issue on your site, and if you can use your reporting tools to show what the problem probably is - and then get the site changed based on your recommendations - and then see an improvement, then you don't need to test. Just implement!

However, you may find that you have a recommendation, backed by data, that doesn't quite get universal approval. How would the conversation go?

"The data shows that this page needs to be fixed - the issue is here, and the survey responses I've looked at show that the page needs a bigger/smaller product image."
"Hmm, I'm not convinced."
"Well, how about we try testing it then? If it wins, we can implement; if not, we can switch it off."
"How does that work, then?"

The ideal 'we love testing' management meeting. Image credit.

This is idealised, I know. But you get the idea, and then you can go on to explain the advantages of testing compared to having to implement and then roll back (when the sales figures go south).

The discussions we had during xChange showed that most testing programs were being initiated by the web analytics team - there were very few (or no) cases where management started the discussion or wanted to run a test. As web professionals, supporting a team with sales and performance targets, we need to be able to use all the online tools available to us - including testing - so it's important that we know how to sell testing to management, and get the resources that it needs. From management's perspective, analytics requires very little support or maintenance (compared to testing) - you tag the site (once, with occasional maintenance) and then pay any subscriptions to the web analytics provider, and pay for the staff (whether that's one member of staff or a small team). Then - that's it. No additional resource needed - no design, no specific IT, no JavaScript developers (except for the occasional tag change, maybe). And every week, the mysterious combination of analyst plus tags produces a report showing how sales and traffic figures went up, down or sideways.

By contrast, testing requires considerable resource. The design team will need to provide imagery and graphics, guidance on page design and so on. The JavaScript developers will need to put mboxes (or the test code) around the test area; the web content team will also need to understand the changes and make them as necessary. And that's just for one test. If you're planning to build up a test program (and you will be, in time) then you'll need to have the support teams available more frequently. So - what are the benefits of testing? And how do you sell them to management, when they're looking at the list of resources that you're asking for?

How to sell testing to management

1. Testing provides the opportunity to do that: test something that the business is already thinking of changing. A change of banners? A new page layout? As an analyst, you'll need to be ahead of the change curve to do this, and aware of changes before they happen, but if you get the opportunity then propose to test a new design before it goes live. This has the advantage of having most of the resource overhead already taken into account (you don't need to design the new banner/page) but it has one significant disadvantage: you're likely to find that there's a major bias towards the new design, and management may just go ahead and implement anyway, even if the test shows negative results for it.

2. A good track record of analytics wins will support your case for testing. You don't have to go back to prior analysis or recommendations and be as direct as, "I told you so," but something like, "The changes made following my analysis and recommendations on the checkout pages have led to an improvement in sales conversion of x%." is likely to be more persuasive. And this brings me neatly on to my next suggestion.

3. Your main aim in selling testing is to ensure you can get the money for testing resources, and for implementation. As I mentioned above, testing takes time, resource and expertise - or, to put it another way, money. So you'll need to persuade the people who hold the money that testing is a worthwhile investment. How? By showing a potential return on that investment.

"My previous recommendation was implemented and achieved a £1k per week increase in revenue. Additionally, if this test sees a 2% lift in conversion, that will be equal to £3k per week increase in revenue."

It's a bit of a gamble, as I've mentioned previously in discussing testing - you may not see a 2% lift in conversion, it may go flat or negative. But the main focus for the web channel management is going to be money: how can we use the site to make more money? And the answer is: by improving the site. And how do we know if we're improving the site? Because we're testing our ideas and showing that they're better than the previous version.

You do have the follow-up argument (if it does win), that, "If you don't implement this test win it will cost..." because there, you'll know exactly what the uplift is and you'll be able to present some useful financial data (assuming that yesterday's winner is not today's loser!). Talk about £, $ or Euros... sometimes, it's the only language that management speak.

4. Don't be afraid to carry out tests on the same part of a page. I know I've covered this previously - but it reduces your testing overhead, and it also forces you to iterate. It is possible to test the same part of a page without repeating yourself. You will need to have a test program, because you'll be testing on the same part of a page, and you'll need to consult your previous tests (winners, losers and flat results) to make sure you don't repeat them. And on the way, you'll have chance to look at why a test won, or didn't, and try to improve. That is iteration, and iteration is a key step from just testing to having a test program.

5. Don't be afraid to start by testing small areas of a page. Testing full-page redesigns is lengthy, laborious and risky. You can get plenty of testing mileage out of testing completely different designs for a small part of a page - a banner, an image, wording... remember that testing is a management expense for the time being, not an investment, and you'll need to keep your overheads low and have good potential returns (either financial, or learning, but remember that management's primary language is money).

6. Document everything! As much as possible - especially if you're only doing one or two tests at a time. Ask the code developers to explain what they've done, what worked, what issues they faced and how they overcame them. It may be all code to you, but in a few months' time, when you're talking to a different developer who is not familiar with testing and test code, your documentation may be the only thing that keeps your testing program moving.

Also - and I've mentioned this before - document your test designs and your results. Even if you're the only test analyst in your company, you'll need a reference library to work from, and one day, you might have a colleague or two and you'll need to show them what you've done before.

So, to wrap up - remember - it's not a problem if somebody agrees to implement a proposed test. "No, we won't test that, we'll implement it straight away." You made a compelling case for a change - subsequently, you (representing the data) and management (representing gut feeling and intuition) agreed on a course of action. Wins all round.

Setting up a testing program and getting management involvement requires some sales technique, not just data and analysis, so it's often outside the analyst's usual comfort zone. However, with the right approach to management (talk their language, show them the benefits) and a small but scalable approach to testing, you should - hopefully - be on the way to setting up a testing program, and then helping your testing program to gain momentum.

Similar posts I've written about online testing

How many of your tests win?

How long should I run my test for?

The Hierarchy of A/B Testing

Monday, 24 June 2013

Iterating, Creating, Risk and Reward - Discussion

My second huddle at XChange 2013 Berlin looked at what to test, how to set up a testing program and how to get management buy-in. We talked about the best way to get a test program set up, how to achieve critical mass and how to build momentum for an online testing program.

I was intending to revisit some of the topics from my earlier post on creative versus iterative testing, but the discussion (as with my first huddle on yesterday's winner, today's loser) very quickly went off on a tangent and never looked back!

There are a number of issues in either starting or building a testing program - here are a few that we discussed:

Lack of management buy-in
Selling web analytics and reporting is not always easy, especially if you're working in (or with) a company that's largely focused on high-street bricks-and-mortar presence, or if the company is historically telephone or catalogue. Trying to sell the idea of online testing can be very tricky indeed. "Why should we test - we know what's best anyway!" is a common response, but the truth is that intuition is rarely right 100% of the time; here are few counter-arguments that you may (or may not) want to try:

"Would you like to submit your own design to include in the test?"
"Could you suggest some other ideas for improving this banner/button/page?"
"Do you think there is a different way we could improve the page and reach/exceed our sales target?"

Other ways of getting management (and other staff, colleagues and stakeholders) to engage with the test is to ask them to guess which recipe or design will win - and put their names to it. If you can market this well, then very quickly, people will start asking how the test is going, if their design is winning. Better still, if their design is losing, they'll probably want to know why, and might even start (1) interrogating the data and (2) designing a follow-up test.

As we commented during our discussion, it's worth saying that you may need to distinguish between a bad recipe and a good manager. "Yes, you are still a good analyst or manager or designer, it's just that people didn't like your design."

Lack of resource
This could be a lack of IT support, design support or JavaScript developer time. Almost all tests are dependent on some sort of IT and design support (although I have heard of analysts and testers testing their own Photoshop creative). It's difficult - as we'll see below - because without design support, you are restricted in what you can test. However, there are a number of test areas that you can work on which are light on design, are light on code maintenance, and which could potentially show useful (and even positive) test results.

- banner imagery - to include having people or no people; a picture of the product or no product
- banner wording - buy-one-get-one-free, or two-for-one, or 50% off? Or maybe even 'Half price'? Wording will probably require even less design work than imagery, and you (as the tester, or analyst) may even be able to set this one up yourself.
- calls to action - Continue? Add to cart? Add to basket? Select? Make payment? This site has a huge gallery of continue shopping buttons, (when a customer has added an item to basket, and you want to persuade them to keep shopping). There are some suggestions on which may work best - and they don't even change the wording. There are many other things to try - colour; arrow or no arrow; CAPITALS or Initial Capitals?

The advantage of these tests is that they can be carried out on the same area of the same page - once the test code has been inserted by the IT or JavaScript teams, you can set up a series of tests just by changing the creative that is being tried. Many of those in the huddle said that once they had obtained a winner, they would then push that to 100% traffic through the testing software until the next test was ready - further reducing the dependency on IT support.

How to sell flat results
There is nothing worse for an analyst or tester to find out that the test results are flat (there's no significant difference in the performance of the test recipes - all the results were the same). The test has taken months to sell, weeks to design and code, and a few weeks to run, and the results say that there's no difference between the original version (which may have had management backing) and your new analytics-backed version. And what do you get? "You said that online testing would improve our performance by 2%, 5%, 7.5%..."

Actually, the results only appear to say there's no difference... so it's time to do some digging!

Firstly, was the difference between the two test recipes large enough and distinct enough? One member of the huddle quoted the Eisenberg brothers: "If you ask people if they prefer green apples or red apples, you're unlikely to get a difference. If you ask them if they prefer apples or chocolate, you'll see a result."

This is something to consider before the test - are the recipes different enough? It's not always easy to say in advance (!) and there is a greater risk of the test recipe losing if the design is too different, but that's the point - iterating is 'safer' than creating, but does include the possibility that it may go flat. How much risk you're prepared to take may depend on external factors such as how much design resource you can obtain and how important it is to get a non-zero result.

Secondly: analysing flat results will require some concerted data analysis. Overall, the number of orders for the two recipes, and the average order value were the same...

But how many people clicked on the new banner? Or how many people bounced or exited from the test page?

Did you get more people to click on your new call-to-action button - and then those people left at the next page? Why?

Did the banner work better for higher-value customers, who then left on the next page because the item they were actually looking for wasn't featured? Did all visitor segments behave in the same way?

Was there a disconnect between the call to action and the next page? Was the next page really what people would have expected?

Did you offer a 50%-off deal but then not make it clear in the checkout process? It's human nature to study and review a test loss, to accept a win without too much study and to completely write off a flat result, but by applying the same level of rigour to a 'flat' result as to a loss, it's still possible to learn something valuable.

How do you set up a testing program?
We discussed how managers and clients generally prefer to start a testing program in the checkout process - it's a nice, easy, linear funnel with plenty of potential for optimisation, and it's very close to the money. If you improve a checkout page, then the financial metrics will automatically improve as a result.

But how do you test in the product description pages, where visitors browse around before selecting an item? We talked about page purpose: what is the idea of a page? What's the main action that you want a user to take after they have seen this page? Is it to complete a lead generation form? Is it to call the sales telephone line? Is it to 'add to cart'? The success metric is for the page should be the key success metric for the test. You'll need to keep an eye on the end-of-funnel metrics (conversion, order value, and so on) but providing those are flat or trending positively, then you can use the page-purpose metrics to measure the success of your test. If you're tracking an offline conversion (call the sales line, for example) then you'll need to do some extra preparatory work, for example by setting up one telephone line per recipe and then arrange to track the volumes of telephone calls - but it'll make the test result more useful.

Tracking page-purpose success metrics will also enable to you to run tests more quickly. If you can see a definite, confident lift in a page-purpose metrics, while the overall financial metrics are flat or positive, then you can call a winner before you reach confidence in the overall metrics. The further you are from the checkout process (and the final order page), the longer it is likely to take for an uplift in page performance to filter through to the financial results (in terms of testing time), but you can be happy that you are improving your customers' experience.

Documentation
Another valuable way of helping to build a testing program, and enabling it to develop, is to document your tests. When a test is completed, you'll probably be presenting to the management and the stakeholders - this is also a great opportunity to present to the people who contributed to your test: the designers, the developers, IT and so on. This applies especially if the test is a winner!

When the presentation is completed, file the results deck on a network drive, or somewhere which is widely accessible. Start to build up a list of test recipes, results and findings. We discussed if this is a worthwhile exercise - it's time-consuming, laborious and if there's only one analyst working on the test program, it seems unnecessary.

However, this has a number of benefits:

- you can start to iterate on previous tests (winners, losers and flat results), and this means that future tests are more likely to be successful ("We did this three weeks ago and the results were good, less try to make them even better")

- you can avoid repeating tests, which is a waste of time, resource and energy ("We did this two months ago and the results were negative")

- you can start to understand your customers' behaviour and design new tests (based on the data) which are more likely to win. ("This test showed our visitors preferred this... therefore I suspect they will also prefer this...).

It's also useful when and if the team starts to grow (which is a positive result of a growing testing program) as you can share all the previous learnings.

These benefits will help the testing program gain momentum, so that you can start iterating and spend less time repeating yourself. Hopefully, you'll find that you have fewer meetings where you have to sell the idea of testing - you can point back at prior wins and say to the management, "Look, this worked and achieved 3% lift," and, if you're feeling brave, "And look, you said this recipe would win and it was 5% below the control recipe!"

The discussion ran for 90 minutes, and we discussed even more than this... I just wish I'd been able to write it all down. I'd like to thank all the huddle participants, who made this a very interesting and enjoyable huddle!

Wednesday, 19 June 2013

Why is yesterday's test winner today's loser?

This post comes out of the xChange Berlin huddle which I led on 11 June 2013. xChange is very different from most web analytics conferences - most conferences have speakers and presentations, but xChange is focused around web analytics professionals meeting and discussing in small workshop groups. As the xChange website describes it:
"Expressly designed for enterprise analytics managers and digital marketing and measurement practitioners, X Change brings together top professionals in the field in a no-sales, all business, peer-to-peer environment for deep-dives into cutting edge online measurement topics."

At xChange Berlin 2013, I led two huddle groups - this was the first, entitled, "Why is yesterday's test winner today's loser?". I haven't attributed the content here to any particular participant - this is just a summary of our discussions. I should say now that the discussion was not even close to what I'd anticipated, but was even more interesting as a result!

The discussion kicked off with a review of a test win. Let's suppose that you have run your A/B test, and you have a winner. You ran it for long enough to achieve statistical significance and even achieved consistent trend lines. But somehow, when you implemented it, your financial metrics didn't show the same level of improvement as your test results. And now, the boss has come to your desk to ask if your test was really valid. "What happened? Why is yesterday's test winner today's loser?"

There are a number of reasons for this - let's take a look.

External factors
Yes, A/B tests split your traffic evenly between the test recipes, so that most external factors are accounted for. But what happens if your test was running while you had a large-scale TV campaign, or display or PPC campaign? Yes, that traffic would have been split between your test recipes, so the effect is - apparently - mitigated. But what if the advertising campaign resonated with your test recipe, which went on to win. During the non-campaign period, the control recipe would be better, or perhaps the results would have been more similar. Consequently, the uplift that you saw during the test would not be achieved in normal conditions.

Customer Experience Changes
When we start a test, there is quite often a dip in performance for the test recipe. It's new. It's unfamiliar and users have to become accustomed to it. It often takes a week or so for visitors to get used to it, and for accurate, meaningful and useful test results to develop. In particular, frequent repeat visitors will take some time to adjust to the changes (how often repeat visitors return will depend on your site). The same issue applies when you implement a winner - now, the whole population is seeing a new design, and it will take some time for them to adjust.

Visitor Segments
Perhaps the test recipe worked especially well with a particular visitor segment? Maybe new visitors, or search visitors, or visitors from social media, and that was responsible for the uplift. You have assumed (one way or another) that your population profile is fairly constant. But if you identify that your test recipe won because one or two segments really engaged with it, then you may not see the uplift if your population profile changes. What should you do instead? Set up a targeting implementation: target specific visitors, based on your test results, who engaged more (or converted better) with the test recipe. Show everybody else the same version of your site as usual, but for visitors who fit into a specific segment - show them the test recipe. I'll discuss targeting again at a later date, but here's a post I wrote a few months ago about online personalisation.

Time lapse between test win and implementation
This varied around the members of the group - where a company has a test plan, and there's a need to get a test up and running, it may not be possible to implement straight away. It also depends on what's being tested - can the test recipe be implemented immediately through the site team or CMS, or will it require IT roadmap work? Most of the group would use either the testing software (for example, Test and Target, or Visual Website Optimiser) and immediately set a winning recipe to 100% traffic (or 95%) until the change could be made permanently. Setting a winning recipe to 95% instead of 100% in effect enables the test to run for longer - you can continue to show that the test recipe is winning. It also means that visitors who were in the control group during the test (i.e. saw "Recipe A") will continue to see that recipe until the implementation is complete - better customer experience for that group? Something to think about!

My next post will be about the second huddle that I led, which was based on iterating vs creating. The title came from my recent blog post on iterative testing, but the discussion went in a very different direction, and again, was better for it!

Header tag