uyhjjddddddddddd Web Optimisation, Maths and Puzzles: hypothesis

Header tag

Showing posts with label hypothesis. Show all posts
Showing posts with label hypothesis. Show all posts

Thursday, 21 December 2017

How did a Chemistry Graduate get into Online Testing?

When people examine my CV, they are often intrigued by how a graduate specialising in chemistry transferred into web analytics, and into online testing and optimisation.  Surely there's nothing in common between the two?

I am at a slight disadvantage - after all, I can't exactly say that I always wanted to go into website analysis when I was younger.  No; I was quite happy playing on my home computer, an Acorn Electron with its 32KB of RAM and 8-bit processor running at 1MHz, and the internet hadn't been invented yet.  You needed to buy an external interface just to connect it to a temperature gauge or control an electrical circuit - we certainly weren't talking about the 'internet of things'.  But at school, I was good at maths, and particularly good at science which was something I especially enjoyed.  I carried on my studies, specialising in maths, chemistry and physics, pursuing them further at university.  Along the way, I bought my first PC - a 286 with 640KB memory, then upgraded to a 486SX 25MHz with 2MB RAM, which was enough to support my scientific studies, and enabled me to start accessing the information superhighway.

Nearly twenty years later, I'm now an established web optimization professional, but I still have my interest in science, and in particular chemistry.  Earlier this week, I was reading through a chemistry textbook (yes, it's still that level of interest), and found this interesting passage on experimental method.  It may not seem immediately relevant, but substitute "online testing" or "online optimisation" for Chemistry, and read on.

Despite what some theoreticians would have us believe, chemistry is founded on experimental work.   An investigative sequence begins with a hypothesis which is tested by experiment and, on the basis of the observed results, is ratified, modified or discarded.   At every stage of this process, the accurate and unbiased recording of results is crucial to success.  However, whilst it is true that such rational analysis can lead the scientist towards his goal, this happy sequence of events occurs much less frequently than many would care to admit. 

I'm sure you can see how the practice and thought processes behind chemical experiments translates into care and planning for online testing.  I've been blogging about valid hypotheses and tests for years now - clearly the scientific thinking in me successfully made the journey from the lab to the website.  And the comment that the "happy sequence of experiment winners happen less frequently than many would care to admit" is particularly pertinent, and I would have to agree with it (although I wouldn't like to admit it).  Be honest, how many of your tests win?  After all, we're not doing experimental research purely for academic purposes - we're trying to make money, and our jobs are to get winners implemented and make money for our companies (while upholding our reputations as subject-matter experts).

The textbook continues...

Having made the all important experimental observations, transmitting this information clearly to other workers in the field is of equal importance.   The record of your observations must be made in such a manner that others as well as yourself can repeat the work at a later stage.   Omission of a small detail, such as the degree of purity of a particular reagent, can often render a procedure irreproducible, invalidating your claims and leaving you exposed to criticism.   The scientific community is rightly suspicious of results which can only be obtained in the hands of one particular worker!

The terminology is quite subject-specific here, but with a little translation, you can see how this also applies to online testing.  In the scientific world, there's a far greater emphasis on sharing results with peers - in industry, we tend to keep our major winners to ourselves, unless we're writing case studies (and ask yourself why do we read case studies anyway?) or presenting at conferences.  But when we do write or publish our results, it's important that we do explain exactly how we achieved that massive 197% lift in conversion - otherwise we'll end up  "invalidating our claims and leaving us exposed to criticism.  The scientific community [and the online community even moreso] is rightly suspicious of results which can only be obtained in the hands of one particular worker!"  Isn't that the truth?

Having faced rigorous scrutiny and peer review of my work in a laboratory, I know how to address questions about the performance of my online tests.   Working with online traffic is far safer than handling hazardous chemicals, but the effects of publishing spurious or inaccurate results are equally damaging to an online marketer or a laboratory-based chemist.  Online and offline scientists alike have to be thoughtful in their experimental practice, rigorous in their analysis and transparent in their methodology and calculations.  


Excerpts taken from Experimental Organic Chemistry: Principles and Practice by L M Harwood and C J Moody, published by Blackwell Scientific Publications in 1989 and reprinted in 1990.

Wednesday, 11 February 2015

Pitfalls of Online Optimisation and Testing 2: Spot the Difference

The second pitfall in online optimisation that I would like to look at is why we obtain flat results - totally, completely flat results at all levels of the funnel.  All metrics show the same results - bounce rate, exit rate, cart additions, average order value, order conversion. There is nothing to choose between the two recipes, despite a solid hypothesis and analytics which support your idea.

The most likely cause is that the changes you made in your test recipe were just not dramatic enough.  There are different types of change you could test:
 
*  Visual change (the most obvious) 
*  Path change (where do you take users who click on a "Learn more" link?)
*  Interaction change (do you have a hover state? Is clicking different from hovering? How do you close a pop-up?)


Sometimes, the change could be dramatic but the problem is that it was made on an insignificant part of the site or page.  If you carried out an end-to-end customer journey through the control experience and then through the test experience, could you spot the difference?  Worse still, did you test on a page which has traffic but doesn't actively contribute to your overall sales (is its order participation virtually zero?)?
Is your hypothesis wrong? Did you think the strap line was important? Have you in fact discovered that something you thought was important is being overlooked by visitors?
Are you being too cautious - is there too much at stake and you didn't risk enough? 

Is the part of the site getting traffic? And does that traffic convert? Or is it just a traffic backwater or a pathing dead end?  It could be that you have unintentionally uncovered an area of your site which is not contributing to site perofrmance.

Do your success metrics match your hypothesis? Are you optimising for orders on your Customer Support pages? Are you trying to drive down telephone sales?
Some areas of the site are critical, and small changes have big differences. On the other hand, some parts of the site are like background noise that users filter out (which is a shame when we spend so much time and effort selecting a typeface, colour scheme and imagery which supports our brand!). We agonise over the photos we use on our sites, we select the best images and icons... And they're just hygiene factors that users barely glance at.  On the other hand, there are some parts that are critical - persuasive copy, clear calls to action, product information and specifications.  What we need to know, and can find out through our testing, is what matters and what doesn't.

Another possibility is that you made two counter-acting changes - one improved conversion, and the other worsened it, so that the net change is close to zero. For example, did you make it easier for users to compare products by making the comparison link larger, but put it higher on the page which pushed other important information on the page to a lower position, where it wasn't seen?  I've mentioned this before in the context of landing page bounce rate - it's possible to improve the click through rate on an email or advert by promising huge discounts and low prices... but if the landing page doesn't reflect those offers, then peopl will bounce off it alarmingly quickly.  This should show up in funnel metrics, so make sure you're analysing each step in the funnel, not just cart conversion (user added an item to cart) and order conversion (user completed a purchase).


Alternatively:  did you help some users, but deter others?  Segment your data - new vs returning, traffic source, order value...  did everybody from all segments perform exactly as they did previously, or did the new visitors benefit from the test recipe, while returning visitors found the change unhelpful?

In conclusion, if your results are showing you that your performance is flat, that's not necessarily the same as 'nothing happened'.  If it's true that nothing happened, then you've proved something different - that your visitors are more resilient (or perhaps resistant) to the type of change you're making.  You've shown that the area you've tested, and the way you've tested it, don't matter to your visitors.  Drill down as far as possible to understand if you've genuinely got flat results, and if you have, you can either test much bigger changes on this part of the site, or stop testing here completely, and move on.

The  articles in the Pitfalls of Online Optimisation and Testing series

Article 1:  Are your results really flat?
Article 2: So your results really are flat - why?  
Article 3: Discontinuous Testing

Thursday, 14 August 2014

I am a power-tool A/B skeptic

I have recently enjoyed reading Peter W Szabo's article entitled, "I am an A/B testing skeptical."  Sure, it's a controversial title (especially when he shared it in the LinkedIn Group for web analysts and optimisers), but it's thought-provoking nonetheless.

And reading it has made me realise:  I am a power-drill skeptic. I've often wondered what the benefit of having the latest Black and Decker power tool might actually be.  After all, there are plenty of hand drills out there that are capable of drilling holes in wood, brick (if you're careful) and even metal sheet. The way I see it, there are five key reasons why power drills are not as good as hand-drilling (and I'm not going to discuss the safety risks of holding a high-powered electrical device in your hand, or the risks of flying dust and debris).

5.  There's no consistency in the size of hole I drill.

I can use a hand drill and by watching how hard I press and how quickly I turn the handle, I can monitor the depth and width of the hole I'm drilling.  Not so with a power drill - sometimes it flies off by itself, sometimes it drills too slowly.  I have read about this online, and I've watched some YouTube videos.  I have seen some experienced users (or professionals, or gurus, or power users) drill a hole which is 0.25 ins diameter and 3 ins deep, but when I try to use the same equipment at home, I find that my hole is much wider (especially at the end) and often deeper.  Perhaps I'm drilling into wood and they're drilling into brick? Perhaps I'm not using the same metal bits in my power drill?  Who knows?



4.  Power drill bits wear out faster.

Again, in my experience, the drill bits I use wear out more quickly with a power drill.  Perhaps leaving them on the side isn't the best place for them, especially in a damp environment.  I have found that my hand drill works fine because I keep it in my toolbox and take care of it, but having several drill bits for my power tool means I don't have space or time to keep track of them all; what happens is that I often try to drill with a power-drill bit that's worn out and a little bit rusty, and the results aren't as good as when the drill bits were new.  The drill bits I buy at Easter are always worn out and rusty by Christmas.

The professionals always seem to be using shiny new tools and bits, but not me.  But, as I said, this hasn't been a problem previously because having one hand-drill with only a small selection of bits has made it easier to keep track of them.  That's a key reason why power tools aren't for me. 

3.  Most power drills are a waste of time.

Power drills are expensive, especially when compared to the hand tool version.  They cost a lot of money, and what's the most you can do with them?  Drill holes.  Or, with careful use, screw in screws.  No, they can't measure how deep the hole should be, or how wide.  Some models claim to be able to tell you how deep your hole is while you're drilling it, but that's still pretty limited.  When I want to put up a shelf, I end up with a load of holes in a wall that I don't want, but that's possibly because I didn't think about the size of the shelf, the height I wanted it or what size of plugs I need to put into the wall to get my shelf to stay up (and remain horizontal).  Maybe I should have measured the wall better first, or something.

2.  I always need more holes 

As I mentioned with power drills being a waste of time, I often find that compared to the professionals I have to drill a lot more holes than usual.  They seem to have this uncanny ability to drill the holes in exactly the right places (how do they do that?) and then put their bookshelves up perfectly.  They seem to understand the tools they're using - the drill, the bits, the screws, the plugs, the wall - and yet when I try to do this with one of their new-fangled power-drills, I end up with too many holes.  I keep missing what I'm aiming for; perhaps I need more practice.  As it is, when I've finished one hole, I can often see how I could make it better and what I need to do, and get closer and closer with each of the subsequent holes I drill.  Perhaps the drill is just defective?

1.  Power drills will give you holes, but they won't necessarily be the right size

 This pretty much sums up power drills for me, and the largest flaw that's totally inherent in power tools.  I've already said that they're only useful for drilling holes, and that the holes are often too wide, too short and in the wrong place. In some cases, when one of my team has identified that the holes are in the wrong place, they've been able to quickly suggest a better location - only to then find that that's also incorrect, and then have two wrong holes and still no way of completing my job.  It seems to me that drilling holes and putting up bookshelves (or display shelving, worse still) is something that's just best left to the professionals.  They can afford the best drill bits and the most expensive drills, and they also have the money available to make so many mistakes - it's clear to me that they must have some form of Jedi mind power, or foreknowledge of the kinds of holes they want to drill and where to drill them. 

In conclusion:

Okay, you got me, perhaps I am being a little unkind, but I genuinely believe that A/B testing and the tools to do it are extremely valuable.  There are a lot of web analytics and A/B professionals out there, but there is also a large number of amateurs who want to try their hand at online testing and who get upset or confused when it doesn't work out.  Like any skilled profession, if you want to do analytics and optimisation properly, you can be sure it's harder than it looks (especially if it looks easy).  It takes time and thought to run a good test (let alone build up a testing program) and to make sure that you're hitting the target you're aiming for.  Why are you running this test?  What are you trying to improve?  What are you trying to learn?  It takes more than just the ability to split traffic between two or more designs to run a test.

Yes, I've parodied Peter W Szabo's original article, but that's because it seemed to me the easiest way to highlight some of the misconceptions that he's identified, and which exist in the wider online optimisation community - especially the ideas that 'tests will teach you useful things', and the underlying misconception that 'testing is quick and easy'.  I will briefly mention that you need a reason to run a test (just as you need a reason to drill a hole) and you need to do some analytical thinking (using other tools, not just testing tools) in the same way as you would use a spirit level, a pencil and a ruler when drilling a hole.

Drilling the hole in the wall is only one step in the process of putting up a bookshelf; splitting traffic in a test should be just one step in the optimisation process, and should be preceded by some serious thought and design work, and followed up with careful review and analysis.  Otherwise, you'll never put your shelf up straight, and your tests will never tell you anything.

Wednesday, 16 July 2014

When to Quit Iterative Testing: Snakes and Ladders



I have blogged a few times about iterative testing, the process of using one test result to design a better test and then repeating the cycle of reviewing test data and improving the next test.  But there are instances when it's time to abandon iterative testing, and play analytical snakes and ladders instead.  Surely not?  Well, there are some situations where iterative testing is not the best tool (or not a suitable tool) to use in online optimisation, and it's time to look at other options. 

Three situations where iterative testing is totally unsuitable:

1.  You have optimised an area of the page so well that you're now seeing the law of diminshing returns - your online testing is showing smaller and smaller gains with each test and you're reaching the top of the ladder.
2.  The business teams have identified another part of the page or site that is a higher priority than the area you're testing on.
3.  The design teams want to test something game-changing, which is completely new and innovative.

This is no bad thing.

After all, iterative testing is not the be-all-and-end-all of online optimization.  There are other avenues that you need to explore, and I've mentioned previously the difference between iterative testing and creative testing.  I've also commented that fresh ideas from outside the testing program (typically from site managers who have sales targets to hit) are extremely valuable.  All you need to work out is how to integrate these new ideas into your overall testing strategy.  Perhaps your testing strategy is entirely focused on future-state (it's unlikely, but not impossible). Sometimes, it seems, iterative testing is less about science and hypotheses, and more like a game of snakes and ladders.

Three reasons I've identified for stopping iterative testing.

1.  It's quite possible that you reach the optimal size, colour or design for a component of the page.  You've followed your analysis step by step, as you would follow a trail of clues or footsteps, and it's led you to the top of a ladder (or a dead end) and you really can't imagine any way in which the page component could be any better.  You've tested banners, and you know that a picture of a man performs better than a woman, that text should be green, the call to action button should be orange and that the best wording is "Find out more."  But perhaps you've only tested having people in your banner - you've never tried having just your product, and it's time to abandon iterative testing and leap into the unknown.  It's time to try a different ladder, even if it means sliding down a few snakes first.

2.  The business want to change focus.  They have sales performance data, or sales targets, which focus on a particular part of the catalogue:  men's running shoes; ladies' evening shoes, or high-performance digital cameras.  Business requests can change far more quickly than test strategies, and you may find yourself playing catch-up if there's a new priority for the business.  Don't forget that it's the sales team who have to maintain the site, meet the targets and maximise their performance on a daily basis, and they will be looking for you to support their team as much as plan for future state.  Where possible, transfer the lessons and general principles you've learned from previous tests to give yourself a head start in this new direction - it would be tragic if you have to slide down the snake and start right at the bottom of a new ladder.

3.  On occasions, the UX and design teams will want to try something futuristic, that exploits the capabilities of new technology (such as Scene 7 integration, AJAX, a new API, XHTML... whatever).  If the executive in charge of online sales, design or marketing has identified or sponsored a brand new online technology that will probably revolutionise your site's performance, and he or she wants to test it, then it'll probably get fast-tracked through the testing process.  However, it's still essential to carry out due diligence in the testing process, to make sure you have a proper hypothesis and not a HIPPOthesis.  When you test the new functionality, you'll want to be able to demonstrate whether or not it's helped your website, and how and why.  You'll need to have a good hypothesis and the right KPIs in place.  Most importantly - if it doesn't do well, then everybody will want to know why, and they'll be looking to you for the answers.  If you're tracking the wrong metrics, you won't be able to answer the difficult questions.

As an example, Nike have an online sports shoe customisation option - you can choose the colour and design for your sports shoes, using an online palette and so on.  I'm guessing that it went through various forms of testing (possibly even A/B testing) and that it was approved before launch.  But which metrics would they have monitored?  Number of visitors who tried it?  Number of shoes configured?  Or possibly the most important one - how many shoes were purchased?  Is it reasonable to assume that because it's worked for Nike, that it will work for you, when you're looking to encourage users to select car trim colours, wheel style, interior material and so on?  Or are you creating something that's adding to a user's workload and making it less likely that they will actually complete the purchase?

So, be aware:  there are times when you're climbing the ladder of iterative testing that it may be more profitable to stop climbing, and try something completely different - even if it means landing on a snake!

Wednesday, 9 July 2014

Why Test Recipe KPIs are Vital

Imagine a straightforward A/B test, between a "red" recipe and a "yellow" recipe.  There are different nuances and aspects to the test recipes, but for the sake of simplicity the design team and the testing team just codenamed them "red" and "yellow".  The two test recipes were run against each other, and the results came back.  The data was partially analysed, and a long list of metrics was produced.  Which one is the most important?  Was it bounce rate? Exit rate? Time on page?  Does it really  matter?

Let's take a look at the data, comparing the "yellow" recipe (on the left) and the "red" recipe (on the right).

  

As I said, there's a large number of metrics.  And if you consider most of them, it looks like it's a fairly close-run affair.  

The yellow team on the left had
28% more shots
8.3% more shots on target
22% fewer fouls (a good result)
Similar possession (4% more, probably with moderate statistical confidence)
40% more corners
less than half the number of saves (it's debatable whether more or fewer saves is better, especially if you look at the alternative to a save)
More offsides and more yellow cards (1 vs 0).

So, by most of these metrics, the yellow team (or the yellow recipe) had a good result.  They might even have done better.

However, the main KPI for this test is not how many shots, or shots on target.  The main KPI is goals scored, and if you look at this one metric, you'll see a different picture.  The 'red' team (or recipe) achieved seven goals, compared to just one for the yellow team.

In A/B testing, it's absolutely vital to understand in advance what the KPI is.  Key Performance Indicators are exactly that:  key.  Critical.  Imperative. There should be no more than two or three KPIs and they should match closely to the test plan which in turn, should come from the original hypothesis.  If your test recipe is designed to reduce bounce rate, there is little point in measuring successful leads generated.  If you're aiming for improved conversion, why should you look at time on page?  These other metrics are not-key performance indicators for your test.

Sadly, Brazil's data on the night was not sufficient for them to win - even though many of their metrics from the game were good, they weren't the key metrics.  Maybe a different recipe is needed.

Wednesday, 7 May 2014

Building Testing Program Momentum

I have written previously about getting a testing program off the ground, and selling the idea of testing to management.  It's not easy, but hopefully you'll be able to start making progress and getting a few quick wins under your belt.  Alternatively, you may have some seemingly disastrous tests where everything goes negative, and you wonder if you'll ever get a winner.  I hope that either way, your testing program is starting to provide some business intelligence for you and your company, and that you're demonstrating the value of testing.  Providing positive direction for the future is nice, providing negative direction ("don't ever implement this") is less pleasant but still useful for business.

In this article, I'd like to suggest ways of building testing momentum - i.e. starting to develop from a few ad-hoc tests into a more systematic way of testing.  I've talked about iterative testing a few times now (I'm a big believer) but I'd like to offer practical advice on starting to scale up your testing efforts.

Firstly, you'll find that you need to prioritise your testing efforts.  Which tests are - potentially - going to give you the best return?  It's not easy to say; after all, if you knew the answer you wouldn't have to test.  But look at the high traffic pages, the high entry pages (lots of traffic landing) and the major leaking points in your funnel.  Fixing these pages will certainly help the business.  You'll need to look at potential monetary losses for not fixing the pages (and remember that management typically pays more attention to £ and $ than they do to % uplift).

Secondly - consider the capacity of your testing team.  Is your testing team made up of you, a visual designer and a single Javascript developer, or perhaps a share of development team when they can spare some capacity?  There's still plenty of potential there, but plan accordingly.  I've mentioned previously that there's plenty of testing opportunity available in the wording, position and colour of CTA buttons, and that you don't always need to have major design changes to see big improvements in site performance.


So many ideas, but which is best? One way to find out: run a test!  Image credit

Thirdly - it's possible to dramatically increase the speed (and therefore capacity) of your testing program by testing in two different areas or directions at the same time.  Not simultaneously, but in parallel.  For example, let's suppose you want to test the call to action buttons on your product pages, and you also want to test how you show discounted prices.  These should be relatively easy to design and develop - it's mostly text and colour changes that you're focusing on.  Do you show the new price in green, and the original price in red? Do you add a strikethrough on the original price?  What do you call the new price - "offer" or "reduced"?  There's plenty to think about, and it seems everybody does it differently.  And for the call-to-action button - there's wording, shape (rounded or square corners), border, arrow...  the list goes on.

Now; if you want to test just call-to-action buttons, you have to develop the test (two weeks), run the test (two weeks), analyse the results (two weeks) and then develop the next test (two weeks more).  This is a simplified timeline, but it shows you that you'll only be testing on your site for two weeks out of six (the other four are spent analysing and developing).  Similarly, your development resource is only going to be working for two out of six weeks, and if there's capacity available, then it makes sense to use it.

I have read a little on critical path analysis (and that's it - nothing more), but it occured to me that you could double the speed of your testing program by running two mini-programs alongside each other, let's call them Track A and Track B.  While Track A is testing, Track B could be in development, and then, when the test in Track A is complete, you can switch it off and launch the test in Track B.  It's a little oversimplified, so here's a more plausible timeline (click for a larger image):





Start with Track A first, and design the hypothesis.  Then, submit it to the development team to write the code, and when it's ready, launch the test - Test A1.  While the test is running, begin on the design and hypothesis for the first test in Track B - Test B1.  Then, when it's time to switch off Test A1, you can swap over and launch Test B1.  That test will run, accumulating data and then, when it's complete, you can switch it off.  While test B1 is running, you can review the data in test A1, work out what went well, what went badly - review the hypothesis and improve, then design the next iteration.




If everything works perfectly, you'll reach point X on my diagram and Test A2 will be ready to launch when Test B1 is switched off.


However, we live in the real world, and test A2 isn't quite as successful as it was meant to be.  It takes quite some time to obtain useful data, and the conversion uplift that you anticipated has not happened - it's taking time to reach statistical significance, and so you have to keep it running for longer.  Meanwhile, Test B2 is ready - you've done the analysis, submitted the new design for develoment, and the developers have completed the work.  This means that test B2 is now pending.   Not a problem - you're still utilising all your site traffic for testing, and that's surely an improvement on the 33% usage (two weeks testing, four weeks other activity) you had before.

Eventually, at point Y, test A2 is complete, you switch it off and launch Test B2, which has been pending for a few days/weeks.  However, Test B2 is a disaster and conversion goes down very quickly; there's no option to keep it running.  (If it was trending positively, then you could keep it running).  Even though the next Track A test is still in development, you have got to pull the test - it's clearly hurting site performance and you need to switch it off as soon as possible.

I'm sure parallel processing has been applied in a wide range of other business projects, but this idea translates really well into the world of testing, especially if you're planning to start increasing the speed and capacity of your testing program.  I will give some though to other ways of increasing test program capacity, and - hopefully - write about this in the near future.





Thursday, 9 January 2014

When Good Tests Fail

Seth Godin, online usability expert recently stated simply that,  'The answer to the question, "What if I fail?" is "You will."  The real question is, "What after I fail?"'

Despite rigorous analytics, careful usability studies and thoughtful designing, the results from your latest A/B test are bad.  Conversion worsened; average order value plummeted and people bounced off your home page like it was a trampoline.  Your test failed.  And, if you're taking it personally (and most online professionals do take it very personally), then you failed too.

But, before the boss slashes your optimisation budget, you have the opportunity to rescue the test, by reviewing all the data and understanding the full picture.  Your test failed - but why?  I've mentioned before that tests which fail draw far more attention than those which win - it's just human nature to explore why something went wrong, and we like to attribute blame or responsibility accordingly.  That's why I pull apart my Chess games to find out why I lost.  I want to improve my Chess (I'm not going to stop playing, or fire myself from playing Chess).

So, the boss asks the questions- Why did your test fail?  (And it's suddenly stopped being his test, or our test... it's yours).  Where's the conversion uplift we expected?  And why aren't profits rising?

It's time to review the test plan, the hypothesis and the key questions. Which of these apply to your test?

Answer 1.  The hypothesis was not entirely valid. I have said before that, "If I eat more chocolate, I'll be able to run faster because I will have more energy."  What I failed to consider is the build up of fat in my body, and that eating all that chocolate has made me heavier, and hence I'm actually running more slowly.  I'm not training enough to convert all that fat into movement, and the energy is being stored as fat.

Or, in an online situation:  the idea was proved incorrect.  Somewhere, one of the assumptions that was made was wrong.  This is where the key test questions come in.  The analysis that comes from answering these key questions will help retrieve your test from 'total failure' to 'learning experience'.

Sometimes, in an online context, the change we made in the test had an unforeseen side-effect.  We thought we were driving more people from the product pages to the cart, but they just weren't properly prepared.  We had the button at the bottom of the page, and people who scrolled to the bottom of the page saw the full specs of the new super-toaster and how it needs an extra battery-pack for super-toasting.  We moved the button up the page, more people clicked on it, but realised only at the cart page that it needed the additional battery pack.  We upset more people than we helped, and overall conversion went down.

Unforeseen side-effects in testing leading to adverse performance: 
too much chocolate slows down 100m run times due to increased body mass
Answer 2.  The visual design of the test recipe didn't address the test hypothesis or the key test questions.  In any lab-based scientific experiment, you would expect to set up the apparatus and equipment and take specific measurements based on the experiment you were doing.  You would also set up the equipment to address the hypothesis - otherwise you're just messing about with lab equipment.  For example, if you wanted to measure the force of gravity and how it affects moving objects, you wouldn't design an experiment with a battery, a thermometer and a microphone. 

However, in an online environment, this sort of situation becomes possible, because different people possess the skills required to analyse data and the skills to design banners etc, and the skills to write the HTML or JavaScript code.  The analyst, the designer and the developer need to work closely together to make sure that the test design which hits the screen is going to answer the original hypothesis, and not something else that the designer believes will 'look nice' or that the developer finds easier to code.  Good collaboration between the key partners in the testing process is essential - if the original test idea doesn't meet brand guidelines, or is extremely difficult to code, then it's better to get everybody together and decide what can be done that will still help prove or disprove the hypothesis.


To give a final example from my chocolate-eating context, I wouldn't expect to prove that chocolate makes me run faster by eating crisps (potato chips) instead.  Unless they were chocolate-coated crips?  Seriously.


Answer 3.  Sometimes, the test design and execution was perfect, and we measured the right metrics in the right way.  However, the test data shows that our hypothesis was completely wrong.  It's time to learn something new...!

My hypothesis said that chocolate would make me run faster; but it didn't.  Now, I apologise that I'm not a biology expert and this probably isn't correct, but let's assume it is, review the 'data' and find out why.  


For a start, I put on weight (because chocolate contains fat), but worse still, the sugar in chocolate was also converted to fat, and it wasn't converted back into sugar quickly enough for me to benefit from it while running the 100 metres.  Measurements of my speed show I got slower, and measurements of my blood sugar levels before and after the 100 metres showed that the blood sugar levels fell, because the fat in my body wasn't converted into glucose and transferred to my muscles quickly enough.  Additionally, my body mass rose 3% during the testing period, and further analysis showed this was fat, not muscle.  This increased mass also slowed me down.



Back to online:  you thought people would like it if your product pages looked more like Apple's.  But Apple sell a limited range of products - one phone, one MP3 player, one desktop PC, etc. while you sell 15-20 of each of those, and your test recipe showed only one of your products on the page (the rest were hidden behind a 'View More' link), when you get better financial performance from a range of products.  Or perhaps you thought that prompting users to chat online would help them go through checkout... but you irritated them and put them off.  Perhaps your data showed that people kept leaving your site to talk to you on the phone.  However, when you tested hiding the phone number, in order to get people to convert online, you found that sales through the phone line went down, as expected, but your online sales also fell because people were using the phone line for help completing the online purchase.  There are learnings in all cases that you can use to improve your site further - you didn't fail, you just didn't win ;-)

In conclusion Yes, sometimes test recipes lose.  Hypotheses were incorrect, assumptions were invalid, side-effects were missed and sometimes the test just didn't ask the question it was meant to.  The difference between a test losing and a test failing is in the analysis, and that comes from planning - having a good hypothesis in the first place, and asking the right questions up front which will show why the test lost (or, let's not forget, the reason why a different test won).  Until then, fail fast and learn quickly!





Tuesday, 7 January 2014

The Key Questions in Online Testing

As you begin the process of designing an online test, the first thing you'll need is a solid test hypothesis.  My previous post outlined this, looking at a hypothesis, HIPPOthesis and hippiethesis.  To start with a quick recap, I explained that a good hypothesis says something like, "IF we make this change to our website, THEN we expect to see this improvement in performance BECAUSE we will have made it easier for visitors to complete their task."  Often, we have a good idea about what the test should be - make something bigger, have text in red instead of black... whatever.  

Stating the hypothesis in a formal way will help to draw the ideas together and give the test a clear purpose.  The exact details of the changes you're making in the test, the performance change you expect, and the reasons for the expected changes will be specific to each test, and that's where your web analytics data or usability studies will support your test idea.  For example, if you're seeing a large drop in traffic between the cart page and the checkout pages, and your usability study shows people aren't finding the 'continue' button, then your hypothesis will reflect this.

In between the test hypothesis and the test execution are the key questions.  These are the key questions that you will develop from your hypothesis, and which the test should answer.  They should tie very closely to the hypothesis, and they will direct the analysis of your test data, otherwise you'll have test data that will lack a focus and you'll struggle to tell the story of the test.  Think about what your test should show - what you'd like it to prove - and what you actually want to answer, in plain English.

Let's take my offline example from my previous post.  Here's my hypothesis:  "If I eat more chocolate, then I will be able to run faster because I will have more energy."

It's good - but only as a hypothesis (I'm not saying it's true, or accurate, but that's why we test!).  But before I start eating chocolate and then running, I need to confirm the exact details of how much chocolate, what distance and what times I can achieve at the moment.  If this was an ideal offline test, there would be two of me, one eating the chocolate, and one not.  And if it was ideal, I'd be the one eating the chocolate :-)

So, the key questions will start to drive the specifics of the test and the analysis.  In this case, the first key question is this:  "If I eat an additional 200 grams of chocolate each day, what will happen to my time for running the 100 metres sprint?"

It may be 200 grams or 300 grams; the 100m or the 200m, but in this case I've specified the mass of chocolate and the distance.  Demonstrating the 'will have more energy' will be a little harder to do.  In order to do this, I might add further questions, to help understand exactly what's happening during the test - perhaps questions around blood sugar levels, body mass, fat content, and so on.  Note at this stage that I haven't finalised the exact details - where I'll run the 100 metres, what form the chocolate will take (Snickers? Oreos? Mars?), and so on.  I could specify this information at this stage if I needed to, or I could write up a specific test execution plan as the next section of my test document.



In the online world I almost certainly will be looking at additional metrics - online measurements are rarely as straightforward as offline.  So let's take an online example and look at it in more detail.

"If I move the call-to-action button on the cart page to a position above the fold, then I will drive more people to start the checkout process because more people will see it and click on it."

And the key questions for my online test?

"How is the click-through rate for the CTA button affected by moving it above the fold?"
"How is overall cart-to-complete conversion affected by moving the button?"
"How are these two metrics affected if the button is near the top of the page or just above the fold?"


As you can see, the key questions specify exactly what's being changed - maybe not to the exact pixel, but they provide clear direction for the test execution.  They also make it clear what should be measured - in this case, there are two conversion rates (one at page level, one at visit level).  This is perhaps the key benefit of asking these core questions:  they drive you to the key metrics for the test.

"Yes, but we want to measure revenue and sales for our test."


Why?  Is your test meant to improve revenue and sales?  Or are you looking to reduce bounce rate on a landing page, or improve the consumption of learn content (whitepapers, articles, user reviews etc) on your site?  Of course, your site's reason-for-being is to general sales and revenue.  Your test data may show a knock-on improvement on revenue and sales, and yes, you'll want to make sure that these vital site-wide metrics don't fall off a cliff while you're testing, but if your hypothesis says, "This change should improve home page bounce rate because..." then I propose that it makes sense to measure bounce rate as the primary metric for the test success.  I also suspect that you can quickly tie bounce rate to a financial metric through some web analytics - after all, I doubt that anyone would think of trying to improve bounce rate without some view of how much a successful visitor generates.

So:  having written a valid hypothesis which is backed by analysis, usability or other data (and not just a go-test-this mentality from the boss), you are now ready to address the critical questions for the test.  These will typically be, "How much....?" and "How does XYZ change when...?" questions that will focus the analysis of the test results, and will also lead you very quickly to the key metrics for the test (which may or may not be money-related).

I am not proposing to pack away an extra 100 grams of chocolate per day and start running the 100 metres.  It's rained here every day since Christmas and I'm really not that dedicated to running.  I might, instead, start on an extra 100 grams of chocolate and measure my body mass, blood cholesterol and fat content.  All in the name of science, you understand. :-)

Wednesday, 24 July 2013

The Science of A Good Hypothesis

Good testing requires many things:  good design, good execution, good planning.  Most important is a good idea - or a good hypothesis, but many people jump into testing without a good reason for testing.  After all, testing is cool, it's capable of fixing all my online woes, and it'll produce huge improvements to my online sales, won't it?

I've talked before about good testing, and, "Let's test this and see if it works," is an example of poor test planning.  A good idea, backed up with evidence (data, or usability testing, or other valid evidence) is more likely to lead to a good result.  This is the basis of a hypothesis, and a good hypothesis is the basis of a good test.

What makes a good hypothesis?  What, and why.

According to Wiki Answers, a hypothesis is, "An educated guess about the cause of some observed (seen, noticed, or otherwise perceived) phenomena, and what seems most likely to happen and why. It is a more scientific method of guessing or assuming what is going to happen."

In simple, testing terms, a hypothesis states what you are going to test (or change) on a page, 
what the effect of the change will be, and why the effect will occur.  To put it another way, a hypothesis is an "If ... then... because..." statement.  "If I eat lots of chocolate, then I will run more slowly because I will put on weight."  Or, alternatively, "If I eat lots of chocolate, then I will run faster because I will have more energy." (I wish).



However, not all online tests are born equal, and you could probably place the majority of them into one of three groups, based on the strength of the original theory.  These are tests with a hypothesis, tests with a HIPPOthesis and tests with a hippiethesis.

Tests with a hypothesis

These are arguably the hardest tests to set up.  A good hypothesis will rely on the test analyst sitting down with data, evidence and experience (or two out of three) and working out what the data is saying.  For example, the 'what' could be that you're seeing a 93% drop-off between the cart and the first checkout page.   Why?  Well, the data shows that people are going back to the home page, or the product description page.  Why?  Well, because the call-to-action button to start checkout is probably not clear enough.  Or we aren't confirming the total cost to the customer.  Or the button is below the fold.

So, you need to change the page - and let's take the button issue as an example for our hypothesis.  People are not progressing from cart to checkout very well (only 7% proceed).  [We believe that] if we make the call to action button from cart to checkout bigger and move it above the fold, then more people will click it because it will be more visible.

There are many benefits of having a good hypothesis, and the first one is that it will tell you what to measure as the outcome of the test.  Here, it is clear that we will be measuring how many people move from cart to checkout.  The hypothesis says so.  "More people will click it" - the CTA button - so you know you're going to measure clicks and traffic moving from cart to checkout.  A good hypothesis will state after the word 'then' what the measurable outcome should be.

In my chocolate example above, it's clear that eating choclate will make me either run faster or slower, so I'll be measuring my running speed.  Neither hypothesis (the cart or the chocolate) has specified how big the change is.  If I knew how big the change was going to be, I wouldn't test.  Also, I haven't said how much more chocolate I'm going to eat, or how much faster I'll run, or how much bigger the CTA buttons should be, or how much more traffic I'll convert.  That's the next step - the test execution.  For now, the hypothesis is general enough to allow for the details to be decided later, but it frames the idea clearly and supports it with a reason why.  Of course, the hypothesis may give some indication of the detailed measurements - I might be looking at increasing my consumption of chocolate by 100 g (about 4 oz) per day, and measuring my running speed over 100 metres (about 100 yds) every week.

Tests with a HIPPOthesis

The HIPPO, for reference, is the HIghest Paid Person's Opinion (or sometimes just the HIghest Paid PersOn).  The boss.  The management.  Those who hold the budget control, who decide what's actionable, and who say what gets done.  And sometimes, what they say is that, "You will test this".  There's virtually no rationale, no data, no evidence or anything.  Just a hunch (or even a whim) from the boss, who has a new idea that he likes.  Perhaps he saw it on Amazon, or read about it in a blog, or his golf partner mentioned it on the course over the weekend.  Whatever - here's the idea, and it's your job to go and test it.

These tests are likely to be completely variable in their design.  They could be good ideas, bad ideas, mixed-up ideas or even amazing ideas.  If you're going to run the test, however, you'll have to work out (or define for yourself) what the underlying hypothesis is.  You'll also need to ask the HIPPO - very carefully - what the success metrics are.  Be prepared to pitch this question somewhere between, "So, what are you trying to test?" and "Are you sure this is a productive use of the highly skilled people that you have working for you?"  Any which way, you'll need the HIPPO to determine the success criteria, or agree to yours - in advance.  If you don't, you'll end up with a disastrous recipe being declared a technical winner because it (1) increased time on page, (2) increased time on site or (3) drove more traffic to the Contact Us page, none of which were the intended success criteria for the test, or were agreed up-front, and which may not be good things anyway.

If you have to have to run a test with a HIPPOthesis, then write your own hypothesis and identify the metrics you're going to examine.  You may also want to try and add one of your own recipes which you think will solve the apparent problem.  But at the very least, nail down the metrics...

Tests with a hippiethesis
Hippie:  noun
a person, especially of the late 1960s, who rejected established institutions and values and sought spontaneity, etc., etc.  Also hippy

The final type of test idea is a hippiethesis - laid back, not too concerned with details, spontaneous and putting forward an idea it because it looks good on paper.  "Let's test this because it's probably a good idea that will help improve site performance."  Not as bad as the 'Test this!" that drives a HIPPOthesis, but not fully-formed as a hypothesis, the hippiethesis is probably (and I'm guessing) the most common type of test.

Some examples of hippietheses:


"If we make the product images better, then we'll improve conversion."
"The data shows we need to fix our conversion funnel - let's make the buttons blue  instead of yellow."
"Let's copy Amazon because everybody knows they're the best online."

There's the basis of a good idea somewhere in there, but it's not quite finished.  A hippiethesis will tell you that the lack of a good idea is not a problem, buddy, let's just test it - testing is cool (groovy?), man!  The results will be awesome.  

There's a laid-back approach to the test (either deliberate or accidental), where the idea has not been thought through - either because "You don't need all that science stuff", or because the evidence to support a test is very flimsy or even non-existent.  Perhaps the test analyst didn't look for the evidence; perhaps he couldn't find any.  Maybe the evidence is mostly there somewhere because everybody knows about it, but isn't actually documented.  The danger here is that when you (or somebody else) start to analyse the results, you won't recall what you were testing for, what the main idea was or which metrics to look at.  You'll end up analysing without purpose, trying to prove that the test was a good idea (and you'll have to do that before you can work out what it was that you were actually trying to prove in the first place).The main difference between a hypothesis and hippiethesis is the WHY.  Online testing is a science, and scientists are curious people who ask why.  Web analyst Avinash Kaushik calls it the three levels of so what test.  If you can't get to something meaningful and useful, or in this case, testable and measureable, within three iterations of "Why?" then you're on the wrong track.  Hippies don't bother with 'why' - that's too organised, formal and part of the system; instead, they'll test because they can, and because - as I said, testing is groovy.

A good hypothesis:  IF, THEN, BECAUSE.

To wrap up:  a good hypothesis needs three things:  If (I make this change to the site) Then (I will expect this metric to improve) because (of a change in visitor behaviour that is linked to the change I made, based on evidence).


When there's no if:  you aren't making a change to the site, you're just expecting things to happen by themselves.  Crazy!  If you reconsider my chocolate hypothesis, without the if, you're left with, "I will run faster and I will have more energy".  Alternatively, "More people will click and we'll sell more."  Not a very common attitude in testing, and more likely to be found in over-optimistic entrepreneurs :-)

When there's no then:  If I eat more chocolate, I will have more energy.  So what?  And how will I measure this increased energy?  There are no metrics here.  Am I going to measure my heart rate, blood pressure, blood sugar level or body temperature??  In an online environment:  will this improve conversion, revenue, bounce rate, exit rate, time on page, time on site or average number of pages per visit?  I could measure any one of these and 'prove' the hypothesis.  At its worst, a hypothesis without a 'then' would read as badly as, "If we make the CTA bigger, [then we will move more people to cart], [because] more people will click." which becomes "If we make the CTA bigger, more people will click."  That's not a hypothesis, that's starting to state the absurdly obvious.


When there's no because:  If I eat more chocolate, then I will run faster.  Why?  Why will I run faster?  Will I run slower?  How can I run even faster?  There are metrics here (speed) but there's no reason why.  The science is missing, and there's no way I can actually learn anything from this and improve.  I will execute a one-off experiment and get a result, but I will be none the wiser about how it happened.  Was it the sugar in the chocolate?  Or the caffeine?

And finally, I should reiterate that an idea for a test doesn't have to be detailed, but it must be backed up by data (some, even if it's not great).  The more evidence the better:  think of a sliding scale from no evidence (could be a terrible idea), through to some evidence (a usability review, or a survey response, prior test result or some click-path analysis), through to multiple sources of evidence all pointing the same way - not just one or two data points, but a comprehensive case for change.  You might even have enough evidence to make a go-do recommendation (and remember, it's a successful outcome if your evidence is strong enough to prompt the business to make a change without testing).

Monday, 16 May 2011

Web Analytics: Experimenting to Test a Hypothesis

Experimenting to Test a Hypothesis

After my previous post on reporting, analysing, forecasting and testing, I thought I'd look in more detail at testing.  Not the how-to-do-it, although I'll probably cover that in a later post, but how to take a test and a set of test results and use them to drive recommendations for action.  The action might be 'do this to improve results' or it might be 'test this next'.
As I've mentioned before, I have a scientific background, so I have a strong desire to do tests scientifically, logically and in an ordered way.  This is how science develops - with repeatable tests that drive theories, hypotheses and understanding.  However, in science (by which I mean physics, chemistry and biology), most of  the experiments are with quantitative measurements, while in an online environment (on a website, for example), most of the variables are qualitative.  This may make it harder to develop theories and conclusions, but it's not impossible - it just requires more thought before the testing begins!

Quantitative Data

Quantitative data is data that comes in quantities - 100 grams, 30 centimetres, 25 degrees Celsius, 23 seconds, 100 visitors, 23 page views, and so on.  Qualitative data is data that describes the quality of a variable - what colour is it, what shape is it, is it a picture of a man or a woman, is the text in capitals, is the text bold?  Qualitative data is usually described with words, instead of numbers.  This doesn't make the tests any less scientific (by which I mean testable and repeatable) it just means that interpreting the data and developing theories and conclusions is a little trickier.

For example, experiments with a simple pendulum will produce a series of results.  Varying the length of the pendulum string leads to a change in the time it takes to complete a full swing.  One conclusion from this test would be:  "As the string gets longer, the pendulum takes longer to run."  And a hypothesis would add, "Because the pendulum has to travel further per swing."

Online, however, test results are more likely to be quantitative.  In my previous post, I explained how my test results were as follows:

Red Triangle  = 500 points per day
Green Circle  = 300 points per day
Blue Square = 200 points per day

There's no trending possible here - circles don't have a quantity connected to them, nor a measurable quantity that can be compared to squares or triangles.  This doesn't mean that they can't be compared - they certainly can.  As I said, though, they do need to be compared with care!  In my test, I've combined two quantitative variables - colour and shape - and this has clouded the results completely and made it very difficult to draw any useful conclusions.  I need to be more methodical in my tests, and start to isolate one of the variables (either shape or colour) to determine which combination is better.  Then I can develop a hypothesis - why is this better than that, and move from testing to optimising and improving performance.
Producing a table of the results from the online experiments shows the gaps that need to be filled by testing - it's possible that not all the gaps will need to be filled in, but certainly more of them do!

Numerical results are points per day


COLOUR Red GreenBlueYellow
SHAPE   


Triangle 500   

Circle
300

Square

200

Now there's currently no trend, but by carrying out tests to fill in some of the gaps, it becomes possible to identify trends, and then theories.

Numerical results are points per day


COLOUR Red GreenBlueYellow
SHAPE   


Triangle 500
399
Circle 409 300
553
Square
204200


Having carried out four further tests, it now becomes possible to draw the following conclusions:

1.  Triangle is the best shape for Red and Blue, and based on the results it appears that Triangle is better than Circle is better than Square.
2.  For the colours, it looks as if Red and Yellow are the best.
3.  The results show that for Circle, Yellow did better than Red and Green, and further testing with Yellow triangles is recommended.

I know this is extremely over-simplified, but it demonstrates how results and theories can be obtained from qualitative testing.  Put another way, it is possible to compare apples and oranges, providing you test them in a logical and ordered way.  The trickier bit comes from developing theories as to why the results are the way they are.  For example, do Triangles do better because visitors like the pointed shape?  Does it match with the website's general branding?  Why does the square perform lower than the other shapes?  Does its shape fit in to the page too comfortably and not stand out?  You'll have to translate this into the language of your website, and again, this translation into real life will be trickier too.  You'll really need to take care to make sure that your tests are aiming to fill gaps in results tables, or instead of just being random guesses.  Better still, look at the results and look at the likely areas which will give improvements. 

It's hard, for sure:  with quantitative data, if the results show that making the pendulum longer increases the time it takes for one swing, then yes, making the pendulum even longer will make the time for one swing even longer too.  However, changing from Green to Red might increase the results by 100 points per day, but that doesn't lead to any immediate recommendation, unless you include, "Make it more red."  

If you started with a hypothesis, "Colours that contrast with our general background colours will do better" and your results support this, then yes, an even more clashing colour might do even better, and that's an avenue for further testing.  This is where testing becomes optimising - not just 'what were the results?', but 'what do the results tell us about what was best, and how can we improve even further?'.

In my next posts in this series, I go on to write about how long to run a test for and explain statistical significance, confidence and when to call a test winner.