Header tag

Wednesday 24 July 2013

The Science of A Good Hypothesis

Good testing requires many things:  good design, good execution, good planning.  Most important is a good idea - or a good hypothesis, but many people jump into testing without a good reason for testing.  After all, testing is cool, it's capable of fixing all my online woes, and it'll produce huge improvements to my online sales, won't it?

I've talked before about good testing, and, "Let's test this and see if it works," is an example of poor test planning.  A good idea, backed up with evidence (data, or usability testing, or other valid evidence) is more likely to lead to a good result.  This is the basis of a hypothesis, and a good hypothesis is the basis of a good test.

What makes a good hypothesis?  What, and why.

According to Wiki Answers, a hypothesis is, "An educated guess about the cause of some observed (seen, noticed, or otherwise perceived) phenomena, and what seems most likely to happen and why. It is a more scientific method of guessing or assuming what is going to happen."

In simple, testing terms, a hypothesis states what you are going to test (or change) on a page, 
what the effect of the change will be, and why the effect will occur.  To put it another way, a hypothesis is an "If ... then... because..." statement.  "If I eat lots of chocolate, then I will run more slowly because I will put on weight."  Or, alternatively, "If I eat lots of chocolate, then I will run faster because I will have more energy." (I wish).

However, not all online tests are born equal, and you could probably place the majority of them into one of three groups, based on the strength of the original theory.  These are tests with a hypothesis, tests with a HIPPOthesis and tests with a hippiethesis.
Tests with a hypothesis

These are arguably the hardest tests to set up.  A good hypothesis will rely on the test analyst sitting down with data, evidence and experience (or two out of three) and working out what the data is saying.  For example, the 'what' could be that you're seeing a 93% drop-off between the cart and the first checkout page.   Why?  Well, the data shows that people are going back to the home page, or the product description page.  Why?  Well, because the call-to-action button to start checkout is probably not clear enough.  Or we aren't confirming the total cost to the customer.  Or the button is below the fold.

So, you need to change the page - and let's take the button issue as an example for our hypothesis.  People are not progressing from cart to checkout very well (only 7% proceed).  [We believe that] if we make the call to action button from cart to checkout bigger and move it above the fold, then more people will click it because it will be more visible.

There are many benefits of having a good hypothesis, and the first one is that it will tell you what to measure as the outcome of the test.  Here, it is clear that we will be measuring how many people move from cart to checkout.  The hypothesis says so.  "More people will click it" - the CTA button - so you know you're going to measure clicks and traffic moving from cart to checkout.  A good hypothesis will state after the word 'then' what the measurable outcome should be.


In my chocolate example above, it's clear that eating choclate will make me either run faster or slower, so I'll be measuring my running speed.  Neither hypothesis (the cart or the chocolate) has specified how big the change is.  If I knew how big the change was going to be, I wouldn't test.  Also, I haven't said how much more chocolate I'm going to eat, or how much faster I'll run, or how much bigger the CTA buttons should be, or how much more traffic I'll convert.  That's the next step - the test execution.  For now, the hypothesis is general enough to allow for the details to be decided later, but it frames the idea clearly and supports it with a reason why.  Of course, the hypothesis may give some indication of the detailed measurements - I might be looking at increasing my consumption of chocolate by 100 g (about 4 oz) per day, and measuring my running speed over 100 metres (about 100 yds) every week.

Tests with a HIPPOthesis

The HIPPO, for reference, is the HIghest Paid Person's Opinion (or sometimes just the HIghest Paid PersOn).  The boss.  The management.  Those who hold the budget control, who decide what's actionable, and who say what gets done.  And sometimes, what they say is that, "You will test this".  There's virtually no rationale, no data, no evidence or anything.  Just a hunch (or even a whim) from the boss, who has a new idea that he likes.  Perhaps he saw it on Amazon, or read about it in a blog, or his golf partner mentioned it on the course over the weekend.  Whatever - here's the idea, and it's your job to go and test it.

These tests are likely to be completely variable in their design.  They could be good ideas, bad ideas, mixed-up ideas or even amazing ideas.  If you're going to run the test, however, you'll have to work out (or define for yourself) what the underlying hypothesis is.  You'll also need to ask the HIPPO - very carefully - what the success metrics are.  Be prepared to pitch this question somewhere between, "So, what are you trying to test?" and "Are you sure this is a productive use of the highly skilled people that you have working for you?"  Any which way, you'll need the HIPPO to determine the success criteria, or agree to yours - in advance.  If you don't, you'll end up with a disastrous recipe being declared a technical winner because it (1) increased time on page, (2) increased time on site or (3) drove more traffic to the Contact Us page, none of which were the intended success criteria for the test, or were agreed up-front, and which may not be good things anyway.

If you have to have to run a test with a HIPPOthesis, then write your own hypothesis and identify the metrics you're going to examine.  You may also want to try and add one of your own recipes which you think will solve the apparent problem.  But at the very least, nail down the metrics...

Tests with a hippiethesis
Hippie:  noun
a person, especially of the late 1960s, who rejected established institutions and values and sought spontaneity, etc., etc.  Also hippy

The final type of test idea is a hippiethesis - laid back, not too concerned with details, spontaneous and putting forward an idea it because it looks good on paper.  "Let's test this because it's probably a good idea that will help improve site performance."  Not as bad as the 'Test this!" that drives a HIPPOthesis, but not fully-formed as a hypothesis, the hippiethesis is probably (and I'm guessing) the most common type of test.

Some examples of hippietheses:


"If we make the product images better, then we'll improve conversion."
"The data shows we need to fix our conversion funnel - let's make the buttons blue  instead of yellow."
"Let's copy Amazon because everybody knows they're the best online."

There's the basis of a good idea somewhere in there, but it's not quite finished.  A hippiethesis will tell you that the lack of a good idea is not a problem, buddy, let's just test it - testing is cool (groovy?), man!  The results will be awesome.  

There's a laid-back approach to the test (either deliberate or accidental), where the idea has not been thought through - either because "You don't need all that science stuff", or because the evidence to support a test is very flimsy or even non-existent.  Perhaps the test analyst didn't look for the evidence; perhaps he couldn't find any.  Maybe the evidence is mostly there somewhere because everybody knows about it, but isn't actually documented.  The danger here is that when you (or somebody else) start to analyse the results, you won't recall what you were testing for, what the main idea was or which metrics to look at.  You'll end up analysing without purpose, trying to prove that the test was a good idea (and you'll have to do that before you can work out what it was that you were actually trying to prove in the first place).The main difference between a hypothesis and hippiethesis is the WHY.  Online testing is a science, and scientists are curious people who ask why.  Web analyst Avinash Kaushik calls it the three levels of so what test.  If you can't get to something meaningful and useful, or in this case, testable and measureable, within three iterations of "Why?" then you're on the wrong track.  Hippies don't bother with 'why' - that's too organised, formal and part of the system; instead, they'll test because they can, and because - as I said, testing is groovy.

A good hypothesis:  IF, THEN, BECAUSE.

To wrap up:  a good hypothesis needs three things:  If (I make this change to the site) Then (I will expect this metric to improve) because (of a change in visitor behaviour that is linked to the change I made, based on evidence).


When there's no if:  you aren't making a change to the site, you're just expecting things to happen by themselves.  Crazy!  If you reconsider my chocolate hypothesis, without the if, you're left with, "I will run faster and I will have more energy".  Alternatively, "More people will click and we'll sell more."  Not a very common attitude in testing, and more likely to be found in over-optimistic entrepreneurs :-)

When there's no then:  If I eat more chocolate, I will have more energy.  So what?  And how will I measure this increased energy?  There are no metrics here.  Am I going to measure my heart rate, blood pressure, blood sugar level or body temperature??  In an online environment:  will this improve conversion, revenue, bounce rate, exit rate, time on page, time on site or average number of pages per visit?  I could measure any one of these and 'prove' the hypothesis.  At its worst, a hypothesis without a 'then' would read as badly as, "If we make the CTA bigger, [then we will move more people to cart], [because] more people will click." which becomes "If we make the CTA bigger, more people will click."  That's not a hypothesis, that's starting to state the absurdly obvious.


When there's no because:  If I eat more chocolate, then I will run faster.  Why?  Why will I run faster?  Will I run slower?  How can I run even faster?  There are metrics here (speed) but there's no reason why.  The science is missing, and there's no way I can actually learn anything from this and improve.  I will execute a one-off experiment and get a result, but I will be none the wiser about how it happened.  Was it the sugar in the chocolate?  Or the caffeine?

And finally, I should reiterate that an idea for a test doesn't have to be detailed, but it must be backed up by data (some, even if it's not great).  The more evidence the better:  think of a sliding scale from no evidence (could be a terrible idea), through to some evidence (a usability review, or a survey response, prior test result or some click-path analysis), through to multiple sources of evidence all pointing the same way - not just one or two data points, but a comprehensive case for change.  You might even have enough evidence to make a go-do recommendation (and remember, it's a successful outcome if your evidence is strong enough to prompt the business to make a change without testing).

Wednesday 3 July 2013

Getting an Online Testing Program Off The Ground

One of the unplanned topics from one of my xChange 2013 huddles was how to get an online testing program up and running, and how to build its momentum.  We were discussing online testing more broadly, and this subject came up.  Getting a test program up and running is not easy, but during our discussion a few useful hints and tips emerged, and I wanted to add to them here.
Sometimes, launching a test program is like defying gravity.

Selling plain web analytics isn't easy, but once you have a reporting and analytics program up and running, and you're providing recommendations which are supported by data and seeing improvements in your site's performance, then the next step will probably be to propose and develop a test.  Why test?


On the one hand, if your ideas and recommendations are being wholeheartedly received by the website's management team, then you may never need to resort to a test.  If you can show with data (and other sources, such as survey responses or other voice-of-customer sources) that there's an issue on your site, and if you can use your reporting tools to show what the problem probably is - and then get the site changed based on your recommendations - and then see an improvement, then you don't need to test.  Just implement!

However, you may find that you have a recommendation, backed by data, that doesn't quite get universal approval. How would the conversation go?

"The data shows that this page needs to be fixed - the issue is here, and the survey responses I've looked at show that the page needs a bigger/smaller product image."
"Hmm, I'm not convinced."

"Well, how about we try testing it then?  If it wins, we can implement; if not, we can switch it off."
"How does that work, then?"


The ideal 'we love testing' management meeting.  Image credit.
This is idealised, I know.  But you get the idea, and then you can go on to explain the advantages of testing compared to having to implement and then roll back (when the sales figures go south).

The discussions we had during xChange showed that most testing programs were being initiated by the web analytics team - there were very few (or no) cases where management started the discussion or wanted to run a test.  As web professionals, supporting a team with sales and performance targets, we need to be able to use all the online tools available to us - including testing - so it's important that we know how to sell testing to management, and get the resources that it needs.  From management's perspective, analytics requires very little support or maintenance (compared to testing) - you tag the site (once, with occasional maintenance) and then pay any subscriptions to the web analytics provider, and pay for the staff (whether that's one member of staff or a small team).  Then - that's it.  No additional resource needed - no design, no specific IT, no JavaScript developers (except for the occasional tag change, maybe).  And every week, the mysterious combination of analyst plus tags produces a report showing how sales and traffic figures went up, down or sideways.

By contrast, testing requires considerable resource.  The design team will need to provide imagery and graphics, guidance on page design and so on.  The JavaScript developers will need to put mboxes (or the test code) around the test area; the web content team will also need to understand the changes and make them as necessary.  And that's just for one test.  If you're planning to build up a test program (and you will be, in time) then you'll need to have the support teams available more frequently.  So - what are the benefits of testing?  And how do you sell them to management, when they're looking at the list of resources that you're asking for?

1.  Testing provides the opportunity to do that:  test something that the business is already thinking of changing.  A change of banners?  A new page layout? As an analyst, you'll need to be ahead of the change curve to do this, and aware of changes before they happen, but if you get the opportunity then propose to test a new design before it goes live.  This has the advantage of having most of the resource overhead already taken into account (you don't need to design the new banner/page) but it has one significant disadvantage:  you're likely to find that there's a major bias towards the new design, and management may just go ahead and implement anyway, even if the test shows negative results for it.


2.  A good track record of analytics wins will support your case for testing.  You don't have to go back to prior analysis or recommendations and be as direct as, "I told you so," but something like, "The changes made following my analysis and recommendations on the checkout pages have led to an improvement in sales conversion of x%." is likely to be more persuasive.  And this brings me neatly on to my next suggestion.

3.  Your main aim in selling testing is to ensure you can  get the money for testing resources, and for implementation.  As I mentioned above, testing takes time, resource and manpower - or, to put it another way, money.  So you'll need to persuade the people who hold the money that testing is a worthwhile investment.  How?  By showing a potential return on that investment.

"My previous recommendation was implemented and achieved a £1k per week increase in revenue.  Additionally, if this test sees a 2% lift in conversion, that will be equal to £3k per week increase in revenue."
It's a bit of a gamble, as I've mentioned previously in discussing testing - you may not see a 2% lift in conversion, it may go flat or negative.  But the main focus for the web channel management is going to be money:  how can we use the site to make more money.  And the answer is: by improving the site.  And how do we know if we're improving the site? Because we're testing our ideas and showing that they're better than the previous version.

You do have the follow-up argument (if it does win), that, "If you don't implement this test win it will cost..." because there, you'll know exactly what the uplift is and you'll be able to present some useful financial data (assuming that yesterday's winner is not today's loser!).  Talk about £, $ or Euros... sometimes, it's the only language that management speak.


4.  Don't be afraid to carry out tests on the same part of a page.  I know I've covered this previously - but it reduces your testing overhead, and it also forces you to iterate.  It is possible to test the same part of a page without repeating yourself.  You will need to have a test program, because you'll be testing on the same part of a page, and you'll need to consult your previous tests (winners, losers and flat results) to make sure you don't repeat them.  And on the way, you'll have chance to look at why a test won, or didn't, and try to improve.  That is iteration, and iteration is a key step from just testing to having a test program.  

5.  Don't be afraid to start by testing small areas of a page.  Testing full-page redesigns is lengthy, laborious and risky.  You can get plenty of testing mileage out of testing completely different designs for a small part of a page - a banner, an image, wording... remember that testing is a management expense for the time being, not an investment, and you'll need to keep your overheads low and have good potential returns (either financial, or learning, but remember that management's primary language is money).


6.  Document everything!  As much as possible - especially if you're only doing one or two tests at a time.  Ask the code developers to explain what they've done, what worked, what issues they faced and how they overcame them.  It may be all code to you, but in a few months' time, when you're talking to a different developer who is not familiar with testing and test code, your documentation may be the only thing that keeps your testing program moving.

Also - and I've mentioned this before - document your test designs and your results.  Even if you're the only test analyst in your company, you'll need a reference library to work from, and one day, you might have a colleague or two and you'll need to show them what you've done before.

So, to wrap up - remember - it's not a problem if somebody agrees to implement a proposed test.  "No, we won't test that, we'll implement it straight away."  You made a compelling case for a change - subsequently, you (representing the data) and management (representing gut feeling and intuition) agreed on a course of action.  Wins all round.

Setting up a testing program and getting management involvement requires some sales technique, not just data and analysis, so it's often outside the analyst's usual comfort zone. However, with the right approach to management (talk their language, show them the benefits) and a small but scaleable approach to testing, you should - hopefully - be on the way to setting up a testing program.




Monday 24 June 2013

Iterating, Creating, Risk and Reward - Discussion

My second huddle at XChange 2013 Berlin looked at what to test, how to set up a testing program and how to get management buy-in.  We talked about the best way to get a test program set up, how to achieve critical mass and how to build momentum for an online testing program.  

I was intending to revisit some of the topics from my earlier post on creative versus iterative testing, but the discussion (as with my first huddle on yesterday's winner, today's loser) very quickly went off on a tangent and never looked back!

There are a number of issues in either starting or building a testing program - here are a few that we discussed:

Lack of management buy-in
Selling web analytics and reporting is not always easy, especially if you're working in (or with) a company that's largely focused on high-street bricks-and-mortar presence, or if the company is historically telephone or catalogue.  Trying to sell the idea of online testing can be very tricky indeed.  "Why should we test - we know what's best anyway!" is a common response, but the truth is that intuition is rarely right 100% of the time; here are few counter-arguments that you may (or may not) want to try:

"Would you like to submit your own design to include in the test?"

"Could you suggest some other ideas for improving this banner/button/page?"
"Do you think there is a different way we could improve the page and reach/exceed our sales target?"

Other ways of getting management (and other staff, colleagues and stakeholders) to engage with the test is to ask them to guess which recipe or design will win - and put their names to it.  If you can market this well, then very quickly, people will start asking how the test is going, if their design is winning.  Better still, if their design is losing, they'll probably want to know why, and might even start (1) interrogating the data and (2) designing a follow-up test.

As we commented during our discussion, it's worth saying that you may need to distinguish between a bad recipe and a good manager.  "Yes, you are still a good analyst or manager or designer, it's just that people didn't like your design."


Lack of resource
This could be a lack of IT support, design support or JavaScript developer time.  Almost all tests are dependent on some sort of IT and design support (although I have heard of analysts and testers testing their own Photoshop creative).  It's difficult - as we'll see below - because without design support, you are restricted in what you can test.  However, there are a number of test areas that you can work on which are light on design, are light on code maintenance, and which could potentially show useful (and even positive) test results.

 - banner imagery - to include having people or no people; a picture of the product or no product
 - banner wording - buy-one-get-one-free, or two-for-one, or 50% off?  Or maybe even 'Half price'?  Wording will probably require even less design work than imagery, and you (as the tester, or analyst) may even be able to set this one up yourself.
 - calls to action - Continue?  Add to cart? Add to basket?  Select? Make payment?  This site has a huge gallery of continue shopping buttons, (when a customer has added an item to basket, and you want to persuade them to keep shopping).  There are some suggestions on which may work best - and they don't even change the wording.  There are many other things to try - colour; arrow or no arrow; CAPITALS or Initial Capitals? 

The advantage of these tests is that they can be carried out on the same area of the same page - once the test code has been inserted by the IT or JavaScript teams, you can set up a series of tests just by changing the creative that is being tried.  Many of those in the huddle said that once they had obtained a winner, they would then push that to 100% traffic through the testing software until the next test was ready - further reducing the dependency on IT support.

How to sell flat results

There is nothing worse for an analyst or tester to find out that the test results are flat (there's no significant difference in the performance of the test recipes - all the results were the same).  The test has taken months to sell, weeks to design and code, and a few weeks to run, and the results say that there's no difference between the original version (which may have had management backing) and your new analytics-backed version.  And what do you get?  "You said that online testing would improve our performance by 2%, 5%, 7.5%..."

Actually, the results only appear to say there's no difference... so it's time to do some digging!

Firstly, was the difference between the two test recipes large enough and distinct enough?  One member of the huddle quoted the Eisenberg brothers: "If you ask people if they prefer green apples or red apples, you're unlikely to get a difference.  If you ask them if they prefer apples or chocolate, you'll see a result."



This is something to consider before the test - are the recipes different enough?  It's not always easy to say in advance (!) and there is a greater risk of the test recipe losing if the design is too different, but that's the point - iterating is 'safer' than creating, but does include the possibility that it may go flat.  How much risk you're prepared to take may depend on external factors such as how much design resource you can obtain and how important it is to get a non-zero result.

Secondly:  analysing flat results will require some concerted data analysis.  Overall, the number of orders for the two recipes, and the average order value were the same...

But how many people clicked on the new banner?  Or how many people bounced or exited from the test page?

Did you get more people to click on your new call-to-action button - and then those people left at the next page? Why?

Did the banner work better for higher-value customers, who then left on the next page because the item they were actually looking for wasn't featured?  Did all visitor segments behave in the same way?


Was there a disconnect between the call to action and the next page?  Was the next page really what people would have expected?  


Did you offer a 50%-off deal but then not make it clear in the checkout process?  It's human nature to study and review a test loss, to accept a win without too much study and to completely write off a flat result, but by applying the same level of rigour to a 'flat' result as to a loss, it's still possible to learn something valuable.

How do you set up a testing program?
We discussed how managers and clients generally prefer to start a testing program in the checkout process - it's a nice, easy, linear funnel with plenty of potential for optimisation, and it's very close to the money.  If you improve a checkout page, then the financial metrics will automatically improve as a result.

But how do you test in the product description pages, where visitors browse around before selecting an item?  We talked about page purpose:  what is the idea of a page?  What's the main action that you want a user to take after they have seen this page?  Is it to complete a lead generation form?  Is it to call the sales telephone line?  Is it to 'add to cart'?  The success metric is for the page should be the key success metric for the test.  You'll need to keep an eye on the end-of-funnel metrics (conversion, order value, and so on) but providing those are flat or trending positively, then you can use the page-purpose metrics to measure the success of your test.  If you're tracking an offline conversion (call the sales line, for example) then you'll need to do some extra preparatory work, for example by setting up one telephone line per recipe and then arrange to track the volumes of telephone calls - but it'll make the test result more useful.

Tracking page-purpose success metrics will also enable to you to run tests more quickly.  If you can see a definite, confident lift in a page-purpose metrics, while the overall financial metrics are flat or positive, then you can call a winner before you reach confidence in the overall metrics.  The further you are from the checkout process (and the final order page), the longer it is likely to take for an uplift in page performance to filter through to the financial results (in terms of testing time), but you can be happy that you are improving your customers' experience.


Documentation

Another valuable way of helping to build a testing program, and enabling it to develop, is to document your tests.  When a test is completed, you'll probably be presenting to the management and the stakeholders - this is also a great opportunity to present to the people who contributed to your test: the designers, the developers, IT and so on.  This applies especially if the test is a winner!

When the presentation is completed, file the results deck on a network drive, or somewhere which is widely accessible.  Start to build up a list of test recipes, results and findings.  We discussed if this is a worthwhile exercise - it's time-consuming, laborious and if there's only one analyst working on the test program, it seems unnecessary.

However, this has a number of benefits:

- you can start to iterate on previous tests (winners, losers and flat results), and this means that future tests are more likely to be successful ("We did this three weeks ago and the results were good, less try to make them even better")

- you can avoid repeating tests, which is a waste of time, resource and energy ("We did this two months ago and the results were negative")

- you can start to understand your customers' behaviour and design new tests (based on the data) which are more likely to win.  ("This test showed our visitors preferred this... therefore I suspect they will also prefer this...).

It's also useful when and if the team starts to grow (which is a positive result of a growing testing program) as you can share all the previous learnings.  

These benefits will help the testing program gain momentum, so that you can start iterating and spend less time repeating yourself.  Hopefully, you'll find that you have fewer meetings where you have to sell the idea of testing - you can point back at prior wins and say to the management, "Look, this worked and achieved 3% lift," and, if you're feeling brave, "And look, you said this recipe would win and it was 5% below the control recipe!"

The discussion ran for 90 minutes, and we discussed even more than this... I just wish I'd been able to write it all down.  I'd like to thank all the huddle participants, who made this a very interesting and enjoyable huddle!






Wednesday 19 June 2013

Why is yesterday's test winner today's loser?

This post comes out of the xChange Berlin huddle which I led on 11 June 2013.  xChange is very different from most web analytics conferences - most conferences have speakers and presentations, but xChange is focused around web analytics professionals meeting and discussing in small workshop groups.  As the xChange website describes it:
"Expressly designed for enterprise analytics managers and digital marketing and measurement practitioners, X Change brings together top professionals in the field in a no-sales, all business, peer-to-peer environment for deep-dives into cutting edge online measurement topics."

At xChange Berlin 2013, I led two huddle groups - this was the first, entitled, "Why is yesterday's test winner today's loser?".  I haven't attributed the content here to any particular participant - this is just a summary of our discussions.  I should say now that the discussion was not even close to what I'd anticipated, but was even more interesting as a result!


The discussion kicked off with a review of a test win.  
Let's suppose that you have run your A/B test, and you have a winner.  You ran it for long enough to achieve statistical significance and even achieved consistent trend lines.  But somehow, when you implemented it, your financial metrics didn't show the same level of improvement as your test results.  And now, the boss has come to your desk to ask if your test was really valid.  "What happened?  Why is yesterday's test winner today's loser?"

There are a number of reasons for this - let's take a look.

External factors
Yes, A/B tests split your traffic evenly between the test recipes, so that most external factors are accounted for.  But what happens if your test was running while you had a large-scale TV campaign, or display or PPC campaign?  Yes, that traffic would have been split between your test recipes, so the effect is - apparently - mitigated.  But what if the advertising campaign resonated with your test recipe, which went on to win.  During the non-campaign period, the control recipe would be better, or perhaps the results would have been more similar.  Consequently, the uplift that you saw during the test would not be achieved in normal conditions.

Customer Experience Changes
When we start a test, there is quite often a dip in performance for the test recipe.  It's new.  It's unfamiliar and users have to become accustomed to it.  It often takes a week or so for visitors to get used to it, and for accurate, meaningful and useful test results to develop.  In particular, frequent repeat visitors will take some time to adjust to the changes (how often repeat visitors return will depend on your site).  The same issue applies when you implement a winner - now, the whole population is seeing a new design, and it will take some time for them to adjust.

Visitor Segments
Perhaps the test recipe worked especially well with a particular visitor segment?  Maybe new visitors, or search visitors, or visitors from social media, and that was responsible for the uplift.  You have assumed (one way or another) that your population profile is fairly constant.  But if you identify that your test recipe won because one or two segments really engaged with it, then you may not see the uplift if your population profile changes.  What should you do instead?  Set up a targeting implementation: target specific visitors, based on your test results, who engaged more (or converted better) with the test recipe.  Show everybody else the same version of your site as usual, but for visitors who fit into a specific segment - show them the test recipe.  I'll discuss targeting again at a later date, but here's a post I wrote a few months ago about online personalisation.

Time lapse between test win and implementation
This varied around the members of the group - where a company has a test plan, and there's a need to get a test up and running, it may not be possible to implement straight away.  It also depends on what's being tested - can the test recipe be implemented immediately through the site team or CMS, or will it require IT roadmap work?  Most of the group would use either the testing software (for example, Test and Target, or Visual Website Optimiser) and immediately set a winning recipe to 100% traffic (or 95%) until the change could be made permanently.  Setting a winning recipe to 95% instead of 100% in effect enables the test to run for longer - you can continue to show that the test recipe is winning.  It also means that visitors who were in the control group during the test (i.e. saw "Recipe A") will continue to see that recipe until the implementation is complete - better customer experience for that group?  Something to think about!

My next post will be about the second huddle that I led, which was based on iterating vs creating.  The title came from my recent blog post on iterative testing, but the discussion went in a very different direction, and again, was better for it!



Friday 17 May 2013

A/B testing - how long to test for?


So, your test is up and running!  You've identified where to test and what to test, and you are now successfully splitting traffic between your test recipes.  How long do you keep the test running, and when do you call a winner?  You've heard about statistical significance and confidence, but what does it actually mean?

Anil Batra has recently posted on the subject of statistical significance, and I'll be coming to his article later, but for now, I'd like to begin with an analogy.




Let us suppose that two car manufacturers, Red-Top and Blue-Bottle have each been working on a new car design for the Formula 1 season, and each manufacturer believes that their car is the fastest at track racing.  The solution to this debate seems easy enough - put them against each other, side-by-side - one lap of a circuit, first one back wins.  However, neither team is particularly happy with this idea - there's discussion of optimum racing line, getting the apex of the bends right, and different acceleration profiles.  It's not going to be workable.


Some bright scientist suggests a time trial:  one lap, taken by each car (one after the other) and the quickest time wins.  This works, up to a point.  After all, the original question was, "Which car is the fastest for track racing?" and not, "Which car can go from a standing start to complete a lap quickest?" and there's a difference between the two.  Eventually, everybody comes to an agreement:  the cars will race and race until one of them has a clear overall lead - 10 seconds (for example), at the end of a lap.  For the sake of  this analogy, the cars can start at two different points on the circuit, to avoid any of the racing line issues that we mentioned before.  We're also going to ignore the need to stop for fuel or new tyres, and any difference in the drivers' ability - it's just about the cars.  The two cars will keep racing until there is a winner (a lead of 10 seconds) or until the adjudicators agree that neither car will accrue an advantage that large.


So, the two cars set off from their points on the circuit, and begin racing.  The Red-Top car accelerates very quickly from the standing start, and soon has a 1-second lead on the Blue-Bottle.  However, the Blue-Bottle has better brakes which enable it to corner better, and after 20 laps there's nothing in it.  The Blue-Bottle continues to show improved performance, and after 45 laps, the Blue-Bottle has built a lead of 6.0 seconds.  However, the weather changes from sunny to overcast and cloudy, and the Blue-Bottle is unable to extend its lead over the next 15 laps.  The adjudicators call it a day after 60 laps total.

So, who won?

There are various ways of analysing and presenting the data, but let's take a look at the data and work from there.  The raw data for this analysis is here:  Racing Car Statistical Significance Spreadsheet.


 This first graph shows the lap times for each of the 60 laps:


This first graph tells the same story as the paragraphs above:  laps 1-20 show no overall lead for either car; the blue car is faster from laps 20-45, then from laps 45-60 neither car gains a consistent advantage.  This second graph shows the cumulative difference between the performance of the two cars.  It's not one that's often shown in online testing tools, but it's a useful way of showing which car is winning.  If the red car is winning, then the time difference is negative; if the blue car is ahead, the time difference is positive, and the size of the lead is measured in seconds.
Graph 3, below, is a graph that you will often see (or produce) from online testing tools.  It's the cumulative average report - in this case, cumulative average lap time.  After each lap, the overall average lap time is calculated for all the laps that have been completed so far.  Sometimes called performance 'trend lines', these show at a glance a summary of which car has been winning, which car is winning now, and by how much.  Again, to go back to the original story, we can see how for the first 20 laps, the red car is winning; at 20 laps, the red and blue lines cross (indicating a change in the lead, from red to blue); from laps 20 to 45 we see the gap between the two lines widening, and then how they are broadly parallel from laps 45 to 60.
So far, so good.  Graph 4, below, shows the distribution of lap times for the two cars.  This is rarely seen in online testing tools, and looks better suited to the maths classroom.  With this graph, it's not possible to see who was winning, when, but it's possible to see who was winning at the end.  This graph, importantly, shows the difference in performance in a way which can be analysed mathematically to show not only which car was winning, but how confident we can be that it was a genuine win, and not a fluke.  We can do this by looking at the average (mean) lap time for each car, and also at the spread of lap times.
This isn't going to become a major mathematical treatment, because I'm saving that for next time :-)  However,you can see here that on the whole, the blue car's lap times are faster (the blue peak is to the left, indicating a larger number of shorter lap times) but are slightly more spread out - the blue car has both the fastest and slowest times.

The maths results are as follows:
Overall -
Red:
Average  (mean) = 102.32 seconds.
Standard deviation (measure of spread) = 0.21

Blue:  average (mean) = 102.22 seconds (0.1 seconds faster per lap).
Standard deviation = 0.28 seconds (lap times are spread more widely)

Mathematically, if the average times for the cars are two or more standard deviations apart, then we can say with 99.99% confidence that the results are significant (i.e. are not due to noise, fluke or random chance).  In this case, the results are only around half a standard deviation apart, so it's not possible to say that either car is really a winner.


But hang on, the blue car was definitely winning after 60 laps.  The reason for this is its performance between laps 20 and 45, when it was consistently building a lead over the red car (before the weather changed, in our story).  Let's take a look at the distribution of results for these 26 laps:

A very different story emerges.  The times for both cars have a much smaller spread, and the peak for the blue distribution is much sharper (in English, the blue car's performance was much more consistent from lap to lap).  Here are the stats for this section of the race:

Red:
Average  (mean) = 102.31 seconds
Standard deviation (measure of spread) = 0.08

Blue:  average (mean) = 102.08 seconds (0.23 seconds faster per lap)
Standard deviation = 0.11 seconds (lap times have a narrower distribution)

We can now see how the Blue car won; over the space of 26 laps, it was faster, and more consistently faster too.  The difference between the two averages = 102.31 - 102.08 = 0.23 seconds, and this is over twice the standard deviation for the blue car (0.11 x 2 = 0.22).  Fortunately, most online testing tools will give you a measure of the confidence in your data, so you won't have to get your spreadsheet or calculator out and start calculating standard deviations manually.


Now, here's the question:  are you prepared to call the Blue car a clear winner, based on just part of the data?

Think about this in terms of the performance of an online test between two recipes, Blue and Red.  Would you have called the Red recipe a winner after 10-15 days/laps?  In the same way as a car and driver need time to settle down into a race (acceleration etc), your website visitors will certainly need time to adjust to a new design (especially if you have a high proportion of repeat visitors).  How long?  It depends :-)

In the story, the Red car had better acceleration from the start, but the Blue car had better brakes.  Maybe one of your test recipes is more appealing to first time visitors, but the other works better for repeat visitors, or another segment of your traffic.  Maybe you launched the test on a Monday, and one recipe works better on weekends?

So why did the results perform differently between laps 20-45 and 45-60?  Laps 20-45 are 'normal' conditions, whereas after lap 45, something changed, and n the racing car story, it was due to the weather.  In the online environment, it could be a marketing campaign that you just launched, or your competitors launched.  Maybe a new product, or the start of national holiday, school holiday, or similar?  From that point onward, the performance of the Blue recipe was comparable or identical to the Red.


So, who won?  The Blue car, since its performance in normal conditions was better.  It took time to settle down, but in a normal environment, it's 0.23 seconds faster per lap, with 99+% confidence.  Would you deploy the equivalent Blue recipe in an online environment, or do you think it's cheating to only deploy a winner that is better only during normal conditions, and is just comparable to the Red recipe during campaign periods?  :-)

Let's take a look at Anil Batra's post on testing and significance.  It's a much briefer article than mine (I apologise for the length, and thank you for your patience), but it does explain that you shouldn't stop a test too early.  The question that many people ask is - how long do you let it run for?  And how do you know when you've got a winner (or is everything turning flat?)? The short article has a very valid point:  don't stop too soon!

Next time - a closer, mathematical look at standard deviations, means and distributions, and how they can help identify a winner with confidence!  In the meantime, if you're looking for a more mathematical treatment, I recommend this one from the Online Marketing Tests blog.

Tuesday 14 May 2013

Web Analytics and Testing: Summary so far

It's hard to believe that it's two years since I posted my first blog post on web analytics.  I'd decided to take the step of sharing a solution I'd found to a question I'd once been asked by a senior manager:  "Show me all the pages on our site which aren't getting any traffic."  It's a good question, but not one that's easy to answer, and as it happened, it was a real puzzler for me at the time, and I couldn't come up with the answer quickly enough.  Before I could devise the answer, we were already moving on to the next project.  But I did find an answer (although we never implemented it), and thought about how to share it.

Nevertheless, I decided to blog about my solution, and my first blog post was received kindly by the online community, and so I started writing more around web analytics - sporadically, to be sure - and covering online testing, which is my real area of interest.


Here's a summary of the web analytics and online testing posts that I've written over the last two years.

Pages with Zero Traffic

Here's where it all started, back in May 2011, with the problem I outlined above.  How can you identify which pages on your site aren't getting traffic, when the only tools you have are tag-based (or server-log-based), and which only fire when they are visited?

Web Analytics - Reporting, Forecasting, Testing and Analysing
What do these different terms mean in web analytics?  What's the difference between them - aren't they just the same thing?

Web Analytics - Experimenting to Test a Hypothesis
My first post dedicated entirely to testing - my main online interest.  It's okay to test - in fact, it's a great idea - but you need to know why you're testing, and what you hope to achieve from the test.  This is an introduction to testing, discussing what the point of testing should be.


Web Analytics - Who determines an actionable insight?
The drive in analytics is for actionable insights:  "The data shows this, this and this, so we should make this change on our site to improve performance."  The insight is what the data shows; the actionable part is the "we should make this change".  If you're the analyst, you may think you decide what's actionable or not, but do you?  This is a discussion around the limitations of actionability, and a reminder to focus your analysis on things that really can be actionable.

Web Analytics - What makes testing iterative?
What does iterative testing mean?  Can't you just test anything, and implement it if it wins?  Isn't all testing iterative?  This article looks at what iteration means, and how to become more successful at testing (or at least learn more) by thinking about testing as a consecutive series, not a large number of disconnected one-off events.

A/B testing - A Beginning
The basic principles of A/B testing - since I've been talking about it for some time, here's an explanation of what it does and how it works.  A convenient place to start from when going on to the next topic...


Intro To Multi Variate Testing
...and the differences between MVT and A/B.

Multi-Variate Testing
Multi Variate Testing - MVT  - is a more complicated but powerful way of optimising the online experience, by changing a multitude of variables in one go.  I use a few examples to explain how it works, and how multiple variables can be changed in one test, and still provide meaningful results.  I also discuss the range of tools available in the market at the moment, and the potential drawbacks of not doing MVT correctly.

Web Analytics:  Who holds the steering wheel?
This post was inspired by a video presentation from the Omniture (Adobe) EMEA Summit in 2011.  It showed how web analytics could power your website into the future, at high speed and with great performance, like a Formula 1 racing car.  My question in response was, "Who holds the steering wheel?" I discuss how it's possible to propose improvements to a site by looking at the data and demonstrating what the uplift could be, but how it all comes down to the driver, who provides the direction and, also importantly, has his foot on the brake.

Web Analytics:  A Medical Emergency

This post starts with a discussion about a medical emergency (based on the UK TV series 'Casualty') and looks at how we, as web analysts, provide stats and KPIs to our stakeholders and managers.  Do we provide a medical readout, where all the metrics are understood by both sides (blood pressure, temperature, pulse rate...) or are we constantly finding new and wonderful metrics which aren't clearly understood and are not actionable?  If you only had 10 seconds to provide the week's KPIs to your web manager, would you be able to do it?  Which would you select, and why?

Web Analytics:  Bounce Rate Issues
Bounce rate (the number of people who exit your site after loading just one page, divided by all the people who landed on that page) is a useful but dangerous measure of page performance.  What's the target bounce rate for a page?  Does it have one?  Does it vary by segment (where is the traffic coming from? Do you have the search term?  Is it paid search or natural?)?  Whose fault is it if the bounce rate gets worse?  Why?  It's a hotly debated topic, with marketing and web content teams pointing the finger at each other.  So, whose fault is it, and how can the situation be improved?

Why are your pages getting no traffic?

Having discussed a few months earlier how to identify which pages aren't getting any traffic, this is the follow-up - why aren't your pages getting traffic?  I look at potential reasons - on-site and off-site, and technical (did somebody forget to tag the new campaign page?).

A beginner's social media strategy

Not strictly web analytics or testing, but a one-off foray into social media strategy.  It's like testing - make sure you know what the plan is before you start, or you're unlikely to be successful!

The Emerging Role of the Analyst
A post I wrote specifically for another site - hosted on my blog, but with reciprocal links to a central site where other bloggers share their thoughts on how Web Analytics, and Web Analysts in particular, are becoming more important in e-commerce.

MVT:  A simplified explanation of complex interactions


Multi Variate Testing involves making changes to a number of parts of a page, and then testing the overall result.  Each part can have two or more different versions, and this makes the maths complicated.  An additional issue occurs when one version of one part of a page interacts (either supports or negates) with another part of the page.  Sometimes there's a positive reinforcement, where the two parts work together well, either by echoing the same sales sentiment or by both showing the same product, or whatever.  Sometimes, there's a disconnect between one part and another (e.g. a headline and a picture may not work well together).  This is called an interaction - where one variable reacts with another - and I explain this in more detail.


Too Big Data

Too big to be useful?  To be informative?  It's one thing to collect a user's name, address, blood type, inside leg measurement and eye colour, but what's the point?  It all comes back to one thing:  actionable insights.

Personalisation
The current online political topic:  how much information are web analysts and marketers allowed to collect and use?  I start with an offline parallel and then discuss whether we're becoming overly paranoid about online data collection.

What is Direct Traffic?

After a year of not blogging about web analytics (it was a busy year), I return with an article about a topic I have thought about for a long time.  Direct traffic is described by some people as some of the best traffic you can get, but my experiences have taught me that it can be very different from the 'success of offline or word-of-mouth marketing'.  In fact, it can totally ruin your analysis - here's my view.

Testing - Iterating or Creating?
Having mentioned iterative testing before, I write here about the difference between planned iterative testing, and planned creative testing.  I explain the potential risks and rewards of creative testing (trying something completely new) versus the smaller risks and rewards of iterative testing (improving on something you tested before).



And finally...

A/B testing - where to test
This will form part of a series - I've looked at why we test, and now this is where.  I'll also be looking at how long to test for, and what to test next!


It's been a very exciting two years... and I'm looking forward to learning and then writing more about testing and analytics in the future!

Monday 15 April 2013

A/B Testing: Where to Test?

You've bought the software, you've even read the manual and a few books or blogs about testing, and now you're ready to test.  Last time, I discussed how to design your test, and in this post, I'd like to look at where to test.  Which pages are you going to test on?  There's no denying that some tests are easier to build, develop and write the code for, and some pages will be trickier (especially if they're behind secure firewalls or if the page is largely hard-coded with little scope for inserting JavaScript), but there's definitely a group of pages that are good for testing.

Why?  Because an improvement in the financial performance of some of the key pages of your site will have a dramatic impact on the overall performance of your site.

Here are a few good examples of places where testing is likely to be financially productive:

1.  Landing pages with a high bounce rate


Bounce rate is defined as the number of people who land on your site and then click away without visiting any other pages, divided by the total number who landed.  More technically, it's the number of single-page-visits divided by the total number of entries.  Landing pages - especially your home page or a campaign landing page - are some of the mostly highly trafficked pages on your site.  For this reason, small improvements in bounce rate or on click-through rates on landing page calls to action will help to move your financials.  In particular, if your cost per acquisition is high, or the page has a high entrance rate combined with a high bounce rate, then improving page performance here will help improve your financial figures.

2.  Leaky funnels 

If you have a linear payment process (and who doesn't?) then you can monitor page-to-page conversion in a linear way.  If one page is "leaking" - i.e. people are leaving when they reach that particular page, then that's a definite area to look at.  Revisit the page yourself, and generate some ideas to help improve the page's performance.  Why are people leaving?  What's missing?  What's getting in the way?  Where are they going - are they leaving the site or going back to another page on your site?  Which page?  WHY?



3.  Pages with high exit rates
People have to leave your site - it's a matter of fact.  The question is - are they leaving at appropriate exit points, or are they leaving too early?  Some pages on your site are destination pages, and that's not just the 'thank you for your order' page.    There are other pages where visitors are able to identify product features, find out what they want to know, or download a PDF.  These are all acceptable exit pages, and a high exit rate on these pages is probably not a bad thing.  Just to explain - the exit rate is the number of exits from a page, divided by the number of page views for the page, typically expressed as a percentage.

However, other pages are navigation pages - section pages, category pages, header pages, hub pages, whatever you choose to call them.  The page purpose here is to get people deeper into the site, and if people are leaving on these pages, then visitors are not fulfilling their visit purpose because the pages aren't working properly.   This is similar to the leaky funnel for a non-linear path, but in the same way, it indicates that something on the page isn't optimal.


 
4.  In response to customer comments. 
If you have a survey or feedback mechanism on your site, then take time to read the comments that your visitors have left. Visitors won't necessarily answer your design questions, but their comments can either support am existing test idea you have, she'd light on an issue you've identified with your traffic analysis, or provide you with new test ideas. And they aren't usually hesitant about telling you where the weaknesses in your site are, so be prepared to face some fierce criticism about your site.


The anonymity of a customer survey often leads some visitors to tell you exactly what they think about your site - so don't take it personally! Comments will vary from 'Your site is great' through to 'your site is dreadful' but may take in, 'I can't find the link to track my order,' and 'I can't find spare batteries for my camera' which will help focus your testing efforts.

So, review your stats; check your campaign metrics and listen to what your customers are telling you - you're bound to find some ideas for improving your site, and for testing your own solutions to the problems you've found.  Would you agree?  Do you have other ways of generating test ideas?

In my next post in this series, I intend to look at how long to run a test for and explain statistical significance, confidence and when to call a winner.

Tuesday 26 March 2013

Chemistry Dictionary: Adrenaline (epinephrine)


Adrenaline (epinephrine)

Adrenaline is a hormone, which is a chemical messenger in the body.  When the body is panicked, adrenaline is released into the bloodstream, and it acts on many parts of the body.  It tells the liver to release glucose (sugar) into the bloodstream; it tells the heart to pump faster, and tells the airways to open to get more air into the lungs and more oxygen into the bloodstream.  This is called the ‘fight or flight’ response, as the body prepares to respond to a perceived threat.

The shape of the adrenaline molecule fits into specific ‘receptors’, called adrenergic receptors, found on the cells in the heart, liver and lungs (and many other organs too), and when the adrenaline molecule fits into one of these receptors, it activates the receptor and tells the organs (through further messages) to respond in their own specific way.

Adrenaline was first artificially synthesised in 1904, and since then has become a common treatment for anaphylactic shock. It can be quickly administered to people showing signs of severe allergic reactions, and some people with known severe allergies carry epinephrine auto-injectors in case of an emergency.  Adrenaline is also one of the main drugs used to treat patients who have a low cardiac output — the amount of blood the heart pumps — and cardiac arrest. It can stimulate the muscle and increases the person's heart rate.

It's also a useful starting point for many drugs, because it has a wide range of effects on the body.  For example, its effect on the lungs means that a variation on adrenaline can be used to treat asthma.  One particularly successful drug is salbutamol, and the salbutamol molecule has a lot in common with adrenaline.
Adrenaline
Salbutamol

The differences between salbutamol and adrenaline make salmeterol more "specific" - in other words, salmeterol is designed (or adapted) to make it target just the soft tissue in the lungs and wind-pipe, and affect the heart less strongly.  If you think of adrenaline as a super key that can open many doors, than salbutamol is an adapted key that's only able to open some doors.



You may recall diagrams such as these from from school chemistry classes - chemicals and molecules being illustrated by a series of carbon, oxygen and hydrogen atoms joined together by little lines.  The manufacturers of pharmaceutical compounds pay very close attention to these diagrams.  After all, the difference between a successful drug and a dangerous, toxic or addictive one is often just a hydrogen atom here, a carbon atom there.  Any drug which is released and authorised for sale in the UK has gone through rigorous checking to ensure that it is effective and that any side effects are also known.  Adrenaline is an ideal starting point for drugs, given its widespread effect on the human body; however, it's possible to begin with other starting points, and look to achieve different effects.

Sadly, in the UK, there has recently been an explosion of compounds which mimic the effects of popular illegal drugs such as cocaine, ecstasy and cannabis, but are chemically different enough to avoid being illegal.  Keeping up with the new highs is difficult. Chemical compounds are effectively legal until they are banned, which means the UK Government has no choice but to be reactive once a chemical hits the market, and must move switfly to determine if it is legal.  A recent report from the European Monitoring Centre for Drugs and Drug Addiction, stated that one new legal high was being “discovered” every week in 2011. Additionally, the number of online shops offering at least one psychoactive substance rose from 314 in 2011 to 690 in 2012.

Chemistry moleculemolecule