Tuesday, 31 May 2011

Web Analytics: What makes testing iterative?

What makes testing iterative?


When I was about eight or nine years old, my dad began to teach me the fundamentals of BASIC programming.  He'd been on a course, and learned the basics, and I was eager to learn - especially how to write games. One of the first programs he demonstrated was a simple game called Go-Karts.  The screen loads up:  "You are on a go-kart and the steering isn't working.  You must find the right letter to operate the brakes before you crash.  You have five goes.  Enter letter?"  You then enter a letter, and the program works out if the input is correct, or if it's before or after the letter you've entered.

"J"
"After J"
"P"
"Before P"
"L"

"L is correct - you have stopped your go-kart without crashing! Another game (Y/N)?"

I was reminded of this simple game during the Omniture Summit EMEA 2011 last week, when one of the breakout presenters started talking about testing, and in particular ITERATIVE testing.  Iterative testing should be the natural development from testing, and I've alluded to it in my previous posts about testing.  At its simplest, basic testing involves just comparing one version of a page, creative, banner, text or call-to-action (or whatever) against another and seeing which one works best.  Iterative testing works in a similar, but more advanced way, in a way similar to my dad's Go-Karts game:  start with an answer which is close to the best, and then build on that answer and start from there to develop something better still.  I've talked about coloured shapes as simplified versions of page elements in my previous posts on testing, so I guess it's time to develop a different example!

Suppose I've tested the five following page headlines, and achieved the following points scores (per day), running each one for a week, so that the total test lasted five weeks.

"Cheap hi-quality widgets on sale now" - 135 points
"Discounted quality widgets available now" - 180 points
"Cheap widgets reduced now" - 110 points
"Advanced widgets available now" - 210 points
 "Exclusive advanced widgets on sale now" - 200 points

What would you test next?

This question is the kind of open question which will help you to see if you're doing iterative testing, or just basic testing.  What can we learn from the five tests that we've run so far?  Anything?  Or nothing?  Do we have make another random guess, or can we use these results to guide us towards something that should do well?

Looking at the results from these preliminary tests, the best headline, "Advanced widgets available now" scored almost twice as many points per day as "Cheap widgets reduced now".  At the very worst, we should run with this high-performing headline, which is doing marginally better than the most recent attempt, "Exclusive advanced widgets on sale now."  This shouldn't pose a problem for a web development team - after all, the creative has already been designed and just needs to be re-published.  All that's needed is to admit that the latest version isn't as good as an earlier idea, and to go backwards in order to go forwards.

Anyway:  we can see that "Advanced..." got the best score, and is the best place to start from.  We can also see that the two lowest performing headlines include the word "Cheap" so this looks like a word to avoid.  From this, it looks like "Advanced widgets on sale now" and "Exclusive advanced widgets available now" are places to start from - we've eliminated the word 'cheap' and now we can look at how 'available now' compares to 'on sale now'.  This is the time for developing test variations on these ideas - following the general principles that have been established by the first round of testing. This is not the time for trying a whole set of new ideas; this would mean ignoring all the potential learning and starting to make sub-optimal decisions (as they're sometimes known).

Referring back to my earlier post, this is the time in the process for making hypotheses on your data.  I have to disagree with the speaker at the Omniture EMEA Summit, when she gave an example hypothesis as, "We believe that changing the page headline will drive more engagement with the site, and therefore better conversion."  This is just a theory.  A hypothesis says all that, and then adds, "because visitors read the page headline first when they see a page, and use that as a primary influencer to decide if the page meets their needs."

So, here's a hypothesis on the data:  "Including the word 'cheap' in our headline puts visitors off because they're after premium products, not inexpensive ones.  We need to focus on premium-type words because these are more attractive to our visitors."  In fact - as you can see, I've even added a recommendation after my hypothesis (I couldn't resist).

And that's the foundation of iterative testing - using what's gone before and refining and improving it.  Yes, it's possible that a later iteration might throw up results that are completely unexpected - and worse than before - but then that's the time to improve and refine your hypothesis.  Interestingly, the less shallow hypotheses will still hold true, "We believe that changing the page headline will drive more engagement with the site, and therefore better conversion." - as it isn't specific enough.

Anyway, that's enough on iterative testing for now; I'm off to go and play my dad's second iteration of the Go-Karts game, which went something like, "You are floating down a river, and the crocodiles are gathering.  You must guess how many crocodiles there are in order to escape.  How many (1-50)?"


Tuesday, 24 May 2011

Web Analytics: Who determines an actionable insight?

Who determines what an actionable insight is? 
I ask the question, because I carry out a range of analysis in my current role, analysing click paths and click streams; conversion rates and  attrition; segmenting and forecasting, with the aim of producing actionable insights for the developers in my team, so that we can work  towards improving our website.  But what makes an insight actionable?  I've discovered that it's not just crunching the numbers, asking the right questions and segmenting the  data until you've found something useful, and based a recommendation on it - the recommendation is usually a sentence or two in English, with a few numbers to support it.  

However, even a recommendation of this sort may not become actionable.  

For example, you might recommend  changing a call to action to include the words 'bonus', 'exclusive' or somesuch.  You might have carried out your testing and determined that  the call to action needs to be a red triangle or a green circle.  Unfortunately, if the main focus of the sales and marketing teams is not to sell  green circles, and there's a cross-channel push to sell blue squares, then you'll have to optimise your own work to determine how best to sell  blue squares.  

Sometimes, actionable insights have to include a wider view of the business you're in.  In a situation where you're  recommending how to achive a goal, and the goalposts have moved, then the position of the new goalposts has to become a factor in your  analysis.  It's true that your proposed course of action would score goals in the old goal, but if the target has different, then you need to  readjust.  Use your existing analysis to help you - don't throw it away.  For example, you might do some keyword analysis, and find that 'budget shapes' converts at a better rate than 'cheap triangles' and 'coloured  shapes'.  So, using conversion rates (and, if you can get them, costs per click and so on) you write a recommendation that says, "The  conversion rate for 'budget shapes' is much better than 'cheap triangles' and I can confirm this with statistical confidence, and I therefore  propose that we change our spending accordingly."  However, if paid search isn't on your marketing team's list of priorities (or they've already  reached target for shapes sales this year) because they're focusing on the next display campaign, then you'll need to readjust.  Take account of the learning you've made - keep a note of it - and in particular how you reached your recommendation so that you can use the tools again  next time, and move to the next target.  

On the other hand, you might be presented with a request to analyse a particular campaign.  Perhaps the marketing team want to understand  how their display campaign is performing, or the web content team want to know which shape to promote on the home page.  This is your  opportunity to go out and hit an actionable insight.  It helps, in these terms, to know what's possible - what can be changed in the campaign, or  on the home page, or wherever.  If the promotions team has decided that they want to sell green triangles, then work within those constraints.  If the message on the home page needs to say, "Exclusive shapes for sale here," then make sure this is included in your recommendations.  It  might not be the optimal solution - there may be better options available, and certainly include these in your recommendations - but if it's better than the present version of the site, then it's certainly a valid recommendation, and an actionable one too!  

It's rare that a colleague will come to you with a blank slate and ask what the data shows is the best answer; he or she is more likely to ask for  your input into a decision that's already being made, but in any case, do your best to show what's possible and what's better.  By working  within the constraints that you're set, and with your colleague's agenda already in place, you're much more likely to achieve an actionable  insight that will actually result in action being taken.  This leads to a positive result for you, and for your colleague.  

I liken the situation to batting in a game of cricket.  Sometimes, a batsman will get to take a large stride towards the ball, and play off the front  foot in an expansive style, hitting out and scoring big runs.  Given a clear definition of the area for research on a website, and the ability to test ideas, make larger changes and follow the data where it leads, it's possible to really hit some big wins.








At other times, the batsman has to stand up straight, bat in front of body, and play  off the back foot - in a more defensive way, still hitting the ball but working to the bowler's agenda, and almost having the ball hit the bat, rather than the other way round.  Asked how much traffic a website has had in a given week, day or month, there are few ways of responding to the question without given the short, direct answer.  It's still possible to play big, expansive  strokes off the back foot - the big, bat-swinging strokes that score big runs, when the batsman adjusts his agenda to the bowler's, and reacts  in the most positive way possible.  It's not always possible, and the defensive shots are often easier to make.  In other ways, it often comes down to Mark Twain's remark that, "Most people use statistics the way a drunk uses a lamp post; more for  support than illumination."ttiton."    

Monday, 16 May 2011

Web Analytics: Experimenting to Test a Hypothesis

Experimenting to Test a Hypothesis

After my previous post on reporting, analysing, forecasting and testing, I thought I'd look in more detail at testing.  Not the how-to-do-it, although I'll probably cover that in a later post, but how to take a test and a set of test results and use them to drive recommendations for action.  The action might be 'do this to improve results' or it might be 'test this next'.
As I've mentioned before, I have a scientific background, so I have a strong desire to do tests scientifically, logically and in an ordered way.  This is how science develops - with repeatable tests that drive theories, hypotheses and understanding.  However, in science (by which I mean physics, chemistry and biology), most of  the experiments are with quantitative measurements, while in an online environment (on a website, for example), most of the variables are qualitative.  This may make it harder to develop theories and conclusions, but it's not impossible - it just requires more thought before the testing begins!

Quantitative data is data that comes in quantities - 100 grams, 30 centimetres, 25 degrees Celsius, 23 seconds, 100 visitors, 23 page views, and so on.  Qualitative data is data that describes the quality of a variable - what colour is it, what shape is it, is it a picture of a man or a woman, is the text in capitals, is the text bold?  Qualitative data is usually described with words, instead of numbers.  This doesn't make the tests any less scientific (by which I mean testable and repeatable) it just means that interpreting the data and developing theories and conclusions is a little trickier.

For example, experiments with a simple pendulum will produce a series of results.  Varying the length of the pendulum string leads to a change in the time it takes to complete a full swing.  One conclusion from this test would be:  "As the string gets longer, the pendulum takes longer to run."  And a hypothesis would add, "Because the pendulum has to travel further per swing."

Online, however, test results are more likely to be quantitative.  In my previous post, I explained how my test results were as follows:

Red Triangle  = 500 points per day
Green Circle  = 300 points per day
Blue Square = 200 points per day

There's no trending possible here - circles don't have a quantity connected to them, nor a measurable quantity that can be compared to squares or triangles.  This doesn't mean that they can't be compared - they certainly can.  As I said, though, they do need to be compared with care!  In my test, I've combined two quantitative variables - colour and shape - and this has clouded the results completely and made it very difficult to draw any useful conclusions.  I need to be more methodical in my tests, and start to isolate one of the variables (either shape or colour) to determine which combination is better.  Then I can develop a hypothesis - why is this better than that, and move from testing to optimising and improving performance.
Producing a table of the results from the online experiments shows the gaps that need to be filled by testing - it's possible that not all the gaps will need to be filled in, but certainly more of them do!

Numerical results are points per day


COLOUR Red GreenBlueYellow
SHAPE   


Triangle 500   

Circle
300

Square

200

Now there's currently no trend, but by carrying out tests to fill in some of the gaps, it becomes possible to identify trends, and then theories.

Numerical results are points per day


COLOUR Red GreenBlueYellow
SHAPE   


Triangle 500
399
Circle 409 300
553
Square
204200


Having carried out four further tests, it now becomes possible to draw the following conclusions:

1.  Triangle is the best shape for Red and Blue, and based on the results it appears that Triangle is better than Circle is better than Square.
2.  For the colours, it looks as if Red and Yellow are the best.
3.  The results show that for Circle, Yellow did better than Red and Green, and further testing with Yellow triangles is recommended.

I know this is extremely over-simplified, but it demonstrates how results and theories can be obtained from qualitative testing.  Put another way, it is possible to compare apples and oranges, providing you test them in a logical and ordered way.  The trickier bit comes from developing theories as to why the results are the way they are.  For example, do Triangles do better because visitors like the pointed shape?  Does it match with the website's general branding?  Why does the square perform lower than the other shapes?  Does its shape fit in to the page too comfortably and not stand out?  You'll have to translate this into the language of your website, and again, this translation into real life will be trickier too.  You'll really need to take care to make sure that your tests are aiming to fill gaps in results tables, or instead of just being random guesses.  Better still, look at the results and look at the likely areas which will give improvements. 

It's hard, for sure:  with quantitative data, if the results show that making the pendulum longer increases the time it takes for one swing, then yes, making the pendulum even longer will make the time for one swing even longer too.  However, changing from Green to Red might increase the results by 100 points per day, but that doesn't lead to any immediate recommendation, unless you include, "Make it more red."  

If you started with a hypothesis, "Colours that contrast with our general background colours will do better" and your results support this, then yes, an even more clashing colour might do even better, and that's an avenue for further testing.  This is where testing becomes optimising - not just 'what were the results?', but 'what do the results tell us about what was best, and how can we improve even further?'.

Sunday, 15 May 2011

Physics: The Sound Barrier and Sonic Booms

After sobering up from his drunken walk home, Isaac Newton went to see his friend, Mr Science.  However, as Isaac went along, he noticed that the roadway to Mr Science's house was very busy; Mr Science lived in the middle of town, and it was market day, and Isaac found that there were large crowds of people milling around in front of him.  Still, walking was definitely the quickest way to see his friend, as although gravity had been invented, cars were still some way off in the future.


Isaac Newton was quite keen to get to Mr Science's house to discuss his adventures with apples, including his failed attempt to launch it into space, and started jogging and jostling through the crowd, shouting at people to move out of the way, instead of just meandering through it.  He bumped into people more frequently as he did so, but kept on jogging undeterred, and found that the faster he jogged, the more people he bumped into; in the end, he put his arms out in front of him like a wedge and started pushing his way through with more effort.  This continued until he found that crowds of people were gathering together in front of him, despite him shouting at them, they were barely unable to get out of his way before he started ploughing into them.  Finally, he realised he was going that fast, that the people, all bunched up in front him and desperate to get out of his way, were unable to stand aside and he sent the crowds tumbling left and right in front of him.


When he arrived at Mr Science's house, he recounted the strange behaviour of the crowd and the various stages he'd encountered. "That's interesting," commented Mr Science, "That reminds me of an experiment I've just been running."


Isaac's journey through the crowd is very similar to an aircraft (or a car) as it travels at speeds close to the speed of sound.  The atmosphere is made up of gas particles which travel around, silently bouncing off each other and generally behaving randomly (in the real sense of the word), at speeds which are close to - but less than - the speed of sound.  Particles in a gas are extremely small (as all particles are), and by comparison, the spaces between them are relatively large.  This means that there are large gaps between them, and if you move a large, solid object between them (or, for example, start walking through them) then you're able to push them aside and move through the gas.  


Walking at low speeds, you're not likely to notice this effect, but at larger speeds, for example running, you'll feel the air as it rushes past your face.  Cycling through the air, you'll feel this more strongly, and as you increase your speed, you'll begin to feel the effort of pushing through the air - it'll feel as if there's a wind blowing into you, pushing you back.  This is known as 'air resistance' and it increases as your speed increases.  You're pushing more and more air particles aside, as you cut through the air, and this takes more effort.  At these speeds, it becomes more and more important to get into an aerodynamic position - as low down as possible, elbows tucked in, and so on, to cut through the air as economically and as easily as possible.  In Isaac Newton's case, he put his arms out in front of him like a wedge, so that he could push through the crowds of people as easily as possible.


Shock waves form when object is at speed of sound


















Now, the gas particles in the atmosphere are bouncing around, flying around at close to (but less than) the speed of sound, which is 330 metres per second (or thereabouts).  In an aircraft, it's possible to approach and exceed the speed of sound, but in order to do so, the aircraft has to push through the air particles as if they were a crowd.  At walking and cycling speeds, the air particles can easily move aside as you push through them, but at speeds close to the speed of sound, they particles are unable to get out of the way of an aircraft.  The aircraft has to shove the particles aside - this becomes very difficult at speeds close to the speed of sound - and break through the sound barrier.  

The air particles start to bunch up in front of the nose of the aircraft until eventually (if it continues to accelerate) they are pushed aside in a huge compression wave.  All these particles pushed together in one go produce a loud noise - a sonic boom - as the aircraft exceeds the speed of sound and goes supersonic.  

Object creates a sonic boom
 This sonic boom continues to travel along the ground and will be heard along the line of the aircraft's flight path - it isn't produced just once and then stops.  Mr Science tried to explain all this to Isaac, but Isaac was extremely pleased with having discovered gravity, and wasn't in the mood to discuss ways of beating it in huge flying machines, let alone ones that could travel faster than sound.  "Maybe some other time," he explained to his friend, "When I've finished with the apples."

Friday, 6 May 2011

Web Analytics: Reporting, Analysing, Testing and Forecasting


Reporting, Forecasting, Testing and Analysing


As a website analyst, my working role means that I will take a large amount of online data, sift it, understand it, sift it again and search for the underlying patterns, changes, trends and reasons for those changes.  It's a bit like panning for gold.  In the same way as there's a small amount of gold in plenty of river, there's plenty of data to look at, it's just a case of finding what's relevant, useful and important, and then telling people about it.  Waving it under people's noses; e-mailing it round; printing it out and so on – that’s reporting.

However, if all I do with the data I have is report it, then all I'm probably doing is something similar to reporting the weather for yesterday.  I can make many, many different measurements about the weather for yesterday, using various instruments, and then report the results of my measurements.  I can report the maximum temperature; the minimum temperature; the amount of cloud coverage in the sky; the rainfall; the wind speed and direction and the sunrise and sunset times.  But, will any of this help me to answer the question, "What will the weather be like tomorrow?" or is it just data?  Perhaps I'll look at the weather the day before, and the day before that.  Are there trends in any of the data?  Is the temperature rising?  Is the cloud cover decreasing?  In this way, I might be able to spot trends or patterns in the data that would lead me to conclude that yes, tomorrow is likely to be warmer and clearer than yesterday.  Already my reporting is moving towards something more useful, namely forecasting.

The key difference between the weather and online data, hopefully, is that when we come to analyse business data (marketing data or web data), I'm in a position to change the inputs of today’s data.  I can't do much today with my measurements to influence tomorrow's weather, but online I can change my website’s content, text, layout or whatever, and hopefully make some changes to my online performance.  No amount of measuring or reporting is going to change anything – not the weather, not sales performance.  Only changes to the site will lead to changes to the results.  Then, I can not only forecast tomorrow's online performance, but also make changes to try to improve it.

No change means that there's no way to determine what works and what doesn't.  I've been asked to try and determine, "What does good look like?" but unless I make some guesses at what good might be, and test them out on the website, I'll never know.  What I should be able to do, though, is forecast what future performance will look like - this is the advantage of having a website that doesn't change much.  Providing most of the external factors (for example traffic sources, marketing spend, product pricing) stay the same, I should be able to forecast what performance will be like next week.  Unfortunately, the external factors rarely stay the same, which will make forecasting tricky - but it'll be easier than forecasting performance for a changing website!

Consider the following situation:  here's my online promotion, and I've simplified it (I've removed the text, and really simplified it) and I've reduced it to a colour and a shape.  So I launch my campaign with Red Triangle, and measurements show that it is worth 500 points per day (I'm not discussing whether that's clicks, sales, quotes, telephone calls or what - it's a success metric and I've scored it 500 points per day).



       500 points per day



If I make no changes to the promotion, then I'll keep using Red Triangle, and theoretically it'll keep scoring 500 points each day.  However, I might change it to something else, for example, I might test Green Circle





300 points per day





Now, Green Circle scores 300 points per day, over a week.  Is that good?  Well, Red Triangle scored 500 points per day, so you might think it'd be worth changing it back.  There's a barrier here, in that if I do change it back to Red Triangle, I have to admit that I made a mistake, and that my ideas weren't as good as I thought they were.  Perhaps I'll decide that I can't face going back to Red Triangle, and I'll try Blue Square instead.
 
 


 200 points per day


But what if Blue Square scores only 200 points each day?  Do I keep running it until I'm sure it's not as good, or do I carry out a test of statistical significance?  Perhaps it'll recover?  One thing is for sure; I know what good looks like (it's a Red Triangle at the moment) but I'll have to admit that my two subsequent versions weren't as good; this is a real mental shift - after all, doesn't optimising something mean making it better?  No, it's not scientific and I should probably start testing Red Circles and Green Triangles, but based on the results I've actually obtained, Red Triangle is the best. 

Maybe I shouldn't have done any testing at all.  After all, Green Circle would cost me 200 points per day, and Blue Square costs me 300 points per day.  And I've had to spend time developing the creative and the text - twice.

Now, I firmly believe that testing is valuable in and of itself.  I’m a scientist, with a strong scientific background, and I know how important testing has been, and will continue to be, to the development of science.  However, one of the major benefits of online marketing and sales is that it's comparatively easy to swap and change - to carry out tests and to learn quickly.  It’s not like changing hundreds of advertising posters at bus stops up and down the country – it’s simply a case of publishing new content on the site.  Even sequential tests (instead of A/B tests) like my example above with the coloured shapes, will provide learning.  What's imperative, though, is that the learning is carried forwards.  Having discovered that Red Triangle is the best of the three shapes tried so far, I would not start the next campaign with a variation of Blue SquareLearning must be remembered, not forgotten.

Having carried out tests like this, it becomes possible to analyse the results.  I’ve done the measuring and reporting, and it looks like this:  Red Triangle = 500 pts/day, Green Circle = 300 pts/day, Blue Square = 200 pts/day.

Analysing the data is the next step.  In this case, there genuinely isn’t much data to analyse, so I would recommend more testing.  I would certainly recommend against Green Circle and Blue Square, and would propose testing Yellow Triangle instead, to see if it’s possible to improve on Red Triangle’s performance.  It all sounds so easy, and I know it isn’t, especially when there’s a development cycle to build Yellow Triangle, when Green Circle is already on the shelf, and Blue Square is already up and running.  However, that’s my role – to review and analyse the data and recommend action.  There are occasions when there are other practical reasons for not following the data, and flexibility is key here.

In fact, for me, I’m always looking at what to do on the website next – the nugget of gold which is often a single sentence that says, “This worked better than that, therefore I recommend this…” or “I recommend doing this, because it provided an uplift of 100 points per day".  That’s always the aim, and the challenge, when I’m analysing data.  Otherwise, why analyse?  My role isn’t to report yesterday’s weather.  At the very least, I’m looking to provide a forecast for tomorrow’s weather, and ideally, I’d be able to recommend if an umbrella will be needed tomorrow afternoon, or sun tan lotion.  Beyond that, I’d also like to be able to suggest where to seed the clouds to make it rain, too!





Tuesday, 3 May 2011

Web Analytics: Pages with Zero Traffic

HOW TO TRACK PAGES THAT GET NO TRAFFIC


In this post, I'm wandering from my usual leisure-time subjects to one that's come up at work, and on some web analytics forums:  how can you tell which pages on a website aren't getting any traffic?


It's an interesting question - how can you tell if a page has zero page views - i.e. no traffic.  We're always interested in the pages that generate the most traffic on our sites; the ones that are our superstars, getting the most visitors and attracting the most attention.  However, the flip side of this is that there may be some pages on our sites that have no traffic at all, and are just taking up space, maintenance time and so on, for no benefit at all.


The issue is that all our analytics tools work when our pages are viewed - when visitors load up our pages, point to our links and visit our site, so identifying the ero-traffic pages is not an easy task, and can't be done directly.  Instead, it must be done by a logical process, and my suggestions would be this.  Firstly, identify any suspect pages, which you can tell by process of elimination - run a report that shows all the pages that have had visits, and then deduce which ones haven't.  Or, alternatively, hit all (and that really means ALL - the better your spidering now, the better your results later) the pages on the site during a visit - go through your site and make sure that you visit every page at least once.  This depends on the size of your site - and although I haven't checked, it might be possible to obtain a manual site spidering tool that will go through your site, firing off the javascript tags on each page.  


Once on each page is sufficient, to fire the tag.  If you're doing this manually, make sure you're not using a PC that has its IP address screened out by any filters you may have set up.


Having done this, go on to run a page report for all pages, for the date that covers your spidering session.  Then use the calendar to compare it against any other time frame - in particular, the time frame that you are actually interested in looking for zero-traffic pages.  Sort the pages by the number of page views they got during the time frame of interest, in ascending order.  By doing this, you should see that all the pages that received visits on the test date, but haven't had any during the time frame of interest, come to the top of
the list.  Note from the quick mock-up below how the pages which had the most traffic in the first time frame come to the top of the list.  By reversing the two time frames, it'd be possible to bring the one-day traffic to the top, and compare with the one-month time frame.






It's not pretty, but it should work, and has the advantage that you only have to visit all the pages once (and make a note of the date that you did your visit, for reference, so that you can run further tests in future, as necessary.


Please let me know if this works for you; I haven't tried it (!) but based on my experience it should work successfully.