Web Optimisation, Maths and Puzzles: analysis

Showing posts with label analysis. Show all posts

Wednesday, 21 September 2022

A Quick Checklist for Good Data Visualisation

One thing I've observed during the recent pandemic is that people are now much more interested in data visualisation. Line graphs (or equivalent bar charts) have become commonplace and are being scrutinised by people who haven't looked at them since they were at school. We're seeing heatmaps more frequently, and tables of data are being shared more often than usual. This was prevalent during the pandemic, and people have generally retained their interest in data presentation (although they wouldn't call it that).

This made me consider: as data analysts and website optimisers, are we doing our best to convey our data as accurately and clearly as possible in order to make our insights actionable. We want to share information in a way that is easy to understand and easy to base decisions on, and there are some simple ways to do this (even with 'simple' data), even without glamorous new visualisation techniques.

Here's the shortlist of data visualisation rules

- Tables of data should be presented consistently either vertically or horizontally, don't mix them up
- Graphs should be either vertical bars or horizontal bars; be consistent
- If you're transferring from vertical to horizontal, then make sure that top-to-bottom matches left-to-right
- If you use colour, use it consistently and intuitively.

For example, let's consider the basic table of data: here's one from a sporting context: the English Premiership's Teams in Form: results from a series of six games.

Pos	Team	P	Pts	F	A	GD	Sequence
1	Liverpool	6	16	13	2	11	W W W W W D
2	Tottenham	6	15	10	4	6	W L W W W W
3	West Ham	6	14	17	7	10	D W W W W D

The actual data itself isn't important (unless you're a Liverpool fan), but the layout is what I'm looking at here. Let's look at the raw data layout:

Pos	Category	Metric 1	Metric 2	Metric 3	Metric 4	Derived metric	Sequence
1	Liverpool	6	16	13	2	11	W W W W W D
2	Tottenham	6	15	10	4	6	W L W W W W
3	West Ham	6	14	17	7	10	D W W W W D

The derived metric "GD" is Goal Difference, the total For minus the total Against (e.g. 13-2=11).

Here, the categories are in a column, sorted by rank, and different metrics are arranged in subsequent columns - it's standard for a league table to be shown like this, and we grasp it intuitively. Here's an example from the US, for comparison:

Player	Pass Yds	Yds/Att	Att	Cmp	Cmp %	TD	INT	Rate	1st	1st%	20+
Deshaun Watson	4823	8.9	544	382	0.702	33	7	112.4	221	0.406	69
Patrick Mahomes	4740	8.1	588	390	0.663	38	6	108.2	238	0.405	67
Tom Brady	4633	7.6	610	401	0.657	40	12	102.2	233	0.382	63

You have to understand American Football to grasp all the nuances of the data, but the principle is the same. For example, Yds/Att is yards per attempt, which is Pass Yds divided by Att. Columns of metrics, ranked vertically - in this case, by player.

A real life example of good data visualisation

Here's another example; this is taken from Next Green Car comparison tools:

The first thing you notice is that the categories are arranged in the top row, and the metrics are listed in the first column, because here we're comparing data instead of ranking them. The actual website is worth a look; it compares dozens of car performance metrics in a page that scrolls on and on. It's vertical.

When comparing data, it helps to arrange the categories like this, with the metrics in a vertical list - for a start, we're able to 'scroll' in our minds better vertically than horizontally (most books are in a portrait layout, rather than landscape).

The challenge (or the cognitive challenges) come when we ask our readers to compare data in long rows, instead of columns... and it gets more challenging if we start mixing the two layouts within the same document/presentation. In fact, challenging isn't the word. The word is confusing.

The same applies for bar charts - we generally learn to draw and interpret vertical bars in graphs, and then to do the same for horizontal bars.

Either is fine. A mixture is confusing, especially if the sequence of categories is reversed as well. We read left-to-right and top-to-bottom, and a mixture here is going to be misunderstood almost immediately, and irreversibly.

For example, this table of data (from above)

Pos	Category	Metric 1	Metric 2	Metric 3	Metric 4	Derived metric	Sequence
1	Liverpool	6	16	13	2	11	W W W W W D
2	Tottenham	6	15	10	4	6	W L W W W W
3	West Ham	6	14	17	7	10	D W W W W D

Should not be graphed like this, where the horizontal data has been converted to a vertical layout:

And it should certainly not be graphed like this: yes, the data is arranged in rows and that's remained consistent, but the sequence has been reversed! For some strange reason, this is the default layout in Excel, and it's difficult to fix.

The best way to present the tabular data in a graphical form - i.e. putting the graph into a table - is to match the layout and the sequence.

And keep this consistent across all the data points on all the slides in your presentation. You don't want your audience performing mental gymnastics to make sense of your data. It would be like reading a book, then having to turn the page by 90 degrees after a few pages, then going back again on the next page, then turning it the other way after a few more pages.

You want your audience to spend their mental power analysing and considering how to take action on your insights, and not to spend it trying to read your data.

Thursday, 24 June 2021

How long should I run my test for?

A question I've been facing more frequently recently is "How long can you run this test for?", and its close neighbour "Could you have run it for longer?"

Different testing programs have different requirements: in fact, different tests have different requirements. The test flight of the helicopter Ingenuity on Mars lasted 39.1 seconds, straight up and down. The Wright Brothers' first flight lasted 12 seconds, and covered 120 feet. Which was the more informative test? Which should have run longer?

There are various ideas around testing, but the main principle is this: test for long enough to get enough data to prove or disprove your hypothesis. If your hypothesis is weak, you may never get enough data. If you're looking for a straightforward winner/loser, then make sure you understand the concept of confidence and significance.

What is enough data? It could be 100 orders. It could be clicks on a banner : the first test recipe to reach 100 clicks - or 1,000, or 10,000 - is the winner (assuming it has a large enough lead over the other recipes).

An important limitation to consider is this: what happens if your test recipe is losing? Losing money; losing leads; losing quotes; losing video views. Can you keep running a test just to get enough data to show why it's losing? Testing suddenly becomes an expensive business, when each extra day is costing you revenue. One of the key advantages of testing over 'launch it and see' is the ability to switch the test off if it loses; how much of that advantage do you want to give up just to get more data on your test recipe?

Maybe your test recipe started badly. After all, many do: the change of experience from the normal site design to your new, all-improved, management-funded, executive-endorsed design is going to come as a shock to your loyal customers, and it's no surprise when your test recipe takes a nose-dive in performance for a few days. Or weeks. But how long can you give your design before you have to admit that it's not just the shock of the new design, (sometimes called 'confidence sickness') but that there are aspects of the new design that need to be changed before it will reach parity with your current site? A week? Two weeks? A month? Looking at data over time will help here. How was performance in week 1? Week 2? Week 3? It's possible for a test to recover, but if the initial drop was severe, then you may never recover the overall picture, but if you can find that the fourth week was actually flat (for new and return visitors) then you've found the point where users have adjusted to your new design.

If, however, the weekly gaps are widening, or staying the same, then it's time to pack up and call it a day.

Let's not forget that you probably have other tests in your pipeline which are waiting for the traffic that you're using on your test. How long can they wait until launch?

So, how long should you run your test for? As long as possible to get the data you need, and maybe longer if you can, unless it's
- suffering from confidence sickness (keep it running)
- losing badly, and consistently (unless you're prepared to pay for your test data)
- losing and holding up your testing pipeline

Similar posts I've written about online testing

Getting an online testing program off the ground
Building Momentum in Online testing
How many of your tests win?

Wright Brothers Picture:

"Released to Public: Wilber and Orville Wright with Flyer II at Huffman Prairie, 1904 (NASA GPN-2002-000126)" by pingnews.com is marked with CC PDM 1.0

Friday, 6 March 2020

Analysis versus Interpretation

We have had a disappointingly mild winter.

It snowed on two days...

You will easily notice the bias in that sentence. Friends and long-time readers will know that I love snow, for many reasons. The data from the Meteorological Office puts the winter (1 December - 29 February) into context, using a technique that I've mentioned before - ranking the specific period against the rest of the data set.

So, by any measure, it was a wet and mild winter. Far more rain than usual (across the country), and temperatures were above average.

This was posted on Facebook, a website renowned for its lack of intelligent and considered discussion, and known for the sharp-shooting debates. Was it really wetter than usual? Is global warming to blame? Is this an upward trend (there is insufficient data here) or a fluke?

And then there's the series of distraction questions - how long have records been held? Have the temperature and rainfall data been recorded since the same original date? Is any of that relevant? No.

In my experience, analysis is hard, but anybody, it seems, can carry out the interpretation. However, interpretation is wide open to personal basis, and the real skill is in treating the data impartially and without bias, and interpreting it from that viewpoint. It requires additional data research - for example, is February's data an anomaly or is it a trend? Time to go and look in the archive and support your interpretation with more data.

Friday, 24 January 2020

Project Management: A Trip To The Moon

Scene:  meeting room, some people dialling in remotely.  The plan is to launch a manned rocket to the moon, and the project manager (PM) is kicking off the project.
PM "Right, team, let's plan this space journey to the moon.  What kind of fuel will we use in our rocket?"

Designer 'Before I answer that, we want to discuss the colour of the nose cone.  The design is to paint it blue.'

PM "Okay, blue is fine.  Have you had any thoughts about the engine?"

Designer 'No, but we actually think a red nosecone might be better.'

PM  "Noted.  Let's move on from that, and come back to it nearer the launch time."


Marketing:  We thought blue.  Now, how we will we choose the pilots?
PM  "I was thinking that we would have a rigorous selection process."

Marketing:  "We can do that.  But we'd like to address the name of the spaceship.  Our subsidiary want to call it the USSS Pegasus.  We want to refer to it as the US Pegasus - the 'SS' was a suggestion from our previous owner.  As this is a combined program, we're going to go with the US Pegasus."

PM "Noted.  The US Pegasus.  Now, about the pilots..."

Designer "And the name of the ship must be in blue text."

PM  [making notes] "...blue text..." 

Designer "To match the nose cone."
PM "Now, circling back to the question of the pilots."

Stakeholder:  "Oh, you can't say that.  Circling back suggests that the ship isn't going to land on the moon."
PM "Sure.  So let's go on to the pilots?"

Stakeholder; "Yes, we can sort that out."
PM  "Thanks.  Now - timelines.  Do you have a target date for landing on the moon?"
Stakeholder;  "Yes, we want to land on 28 July, 2020.  When do you need to launch?"

PM  "How long will the flight take?"
Stakeholder  "That depends on the fuel."
PM  "Doesn't it depend on the engine?"
Marketing "Possibly.  But it's important that we land on 28 July."
Stakeholder  "Yes.  28 July.  We've set that date with our president.  It's his birthday"


PM  "So who can give me the launch date?"

Stakeholder "Well, we expected you to provide that."
PM  "Okay, let's assume it takes four days to reach the moon.  Can you have everything built and fuelled by then?"
Stakeholder "And we'll want to check everything works."
PM "Like a test launch?"
Marketing "Oh no, we can't have a test launch.  We can't have our competitors knowing what we're doing."

PM "No test launch?"
Marketing "No."
PM "And the pilots?"
Stakeholder  "I'm working on it."
PM "And the fuel?"
Stakeholder "I'll find somebody.  Somebody somewhere must know something about it."

Marketing  "And we'll need hourly readouts on speed.  Preferably minute by minute.  And oxygen levels; distance from the earth; internal and external temperatures.  All those things."
PM "Are you interested in the size of the engine?"

Stakeholder "We've been planning this for six months already.  We know it'll need an engine."
Engineer;  "Sorry I'm late, I've just joined."
PM  "Thanks for joining.  We're just discussing the rocket engine.  Do you know what size it will be?"
Engineer: "Big."
PM "Big enough?"
Engineer: "Yes.  1000 cubic units.  Big enough."
PM:  "Great.  Thanks.  Let's move on."
Stakeholder:  "Wait, let's just check on that detail.  Are you sure?"


Engineer;  "Yes.  I've done the calculations.  It's big enough."
Stakeholder:  "To get to the moon?"
Engineer:  "Yes."
Stakeholder:  "And back?"
Engineer:  "Yes."
Designer:  "Even if we have blue text instead of red?"

Engineer:  "Yes."

Marketing;  "What about if we have red text."

Engineer;  "The colour of the text isn't going to affect the engine performance."
Stakeholder  "Are you sure?"

Engineer:  "We're not burning the paint as fuel.  We're not painting the engine.  We're good."
PM:  "Thank you.  Now; how much fuel do you need?"

Engineer:  "That depends.  How quickly do you want to get there?"
PM:  "We need to land on the moon on 28 July 2020.  I've estimated a four-day flight time."
Engineer;  "I'd make it five days, to be on the safe side, and I would calculate 6000 units of class-one fuel, approximately."
PM:  "Okay, that sounds reasonable.  Will the number of pilots affect the fuel calculation?"
Engineer: "Yes, but it won't significantly change the 6000 units estimate.  When you know the number and mass of the pilots, we can calculate the fuel tank size we'll need."

Stakeholder;  "But we won't know that until launch."
PM:  "Until launch?"
Stakeholder:  "Yes.  We don't know how many people we want to send to the moon until the day of the launch."
PM:  "And the colour of the text?  And the nose cone?  And the actual text."

Stakeholder:  "Will all depend on people we send."

PM:  "No test launch?"
Marketing;  "No.  We need this to be secret so that our competitors don't know what we're doing."
PM:  "So we're launching an undetermined number of people, in an untested rocket of unknown name and size, to the moon, with an approximate flight time and fuel load, at some point in the future."

Marketing:  "But it must land on 28 July."
PM:  "2020, yes.  Ok, We've run out of time for today, but let's catch up tomorrow with progress.  Between now and then, let's work to decide some of the smaller details like the fuel and the engine, and tomorrow we can cover the main areas, such as the size of the rocket and where it's going.  Thank you, everybody.  Goodbye for now."

Wednesday, 6 March 2019

Analysis is Easy, Interpretation Less So

Every time we open a spreadsheet, or start tapping a calculator (yes, I still do), or plot a graph, we start analysing data. As analysts, it is probably most of what we do all day. It's not necessarily difficult - we just need to know which data points to analyse, which metrics we divide by each other (do you count exit rate per page view, or per visit?) and we then churn out columns and columns of spreadsheet data. As online or website analysts, we plot the trends over time, or we compare pages A, B and C, and we write the result (so we do some reporting at the end as well).

Analysis. Apparently.

As business analysts, it's not even like we have complicated formulae for our metrics - we typically divide X by Y to give Z, expressed to two decimal places, or possibly as a percentage. We're not calculating acceleration due to gravity by measuring the period of a pendulum (although it can be done), with square roots, fractions, and square roots of fractions.

Analysis - dare I say it - is easy.

What follows is the interpretation of the data, and this can be a potential minefield, especially when you're presenting to stakeholders. If analysis is easy, then sometimes interpretation can really be difficult.

For example, let's suppose revenue per visit went up by 3.75% in the last month. This is almost certainly a good thing - unless it went up by 4% in the previous month, and 5% in the same month last year. And what about the other metrics that we track? Just because revenue per visit went up, there are other metrics to consider as well. In fact, in the world of online analysis, we have so many metrics that it's scary - and so accurate interpretation becomes even more important.

Okay, so the average-time-spent-on-page went up by 30 seconds (up from 50 seconds to 1 minute 20). Is that good? Is that a lot? Well, more people scrolled further down the page (is that a good thing - is it content consumption or is it people getting well and truly lost trying to find the 'Next page' button?) and the exit rate went down.

Are people going back and forth trying to find something you're unintentionally hiding? Or are they happily consuming your content and reading multiple pages of product blurb (or news articles, or whatever)? Are you facilitating multiple page consumption (page views per visit is up), or are you sending your website visitors on an online wild goose chase (page views per visit is up)? Whichever metrics you look at, there's almost always a negative and positive interpretation that you can introduce.

This comes back, in part, to the article I wrote last month - sometimes two KPIs is one too many. It's unlikely that everything on your site will improve during a test. If it does, pat yourself on the back, learn and make it even better! But sometimes - usually - there will be a slight tension between metrics that "improved" (revenue went up), metrics that "worsened" (bounce rate went up) and metrics that are just open to anybody's interpretation (time on page; scroll rate; pages viewed per visit; usage of search; the list goes on). In these situations, the metrics which are open to interpretation need to be viewed together, so that they tell the same story, viewed from the perspective of the main KPIs. For example, if your overall revenue figures went down, while time on page went up, and scroll rate went up, then you would propose a causal relationship between the page-level metrics and the revenue data: people had to search harder for the content, but many couldn't find it so gave up.

On the other hand, if your overall revenue figures went up, and time on page increased and exit rate increased (for example), then you would conclude that a smaller group of people were spending more time on the page, consuming content and then completing their purchase - so the increased time on page is a good thing, although the exit rate needs to be remedied in some way. The interpretation of the page level data has to be in the light of the overall picture - or certainly with reference to multiple data points.

I've discussed average time on page before. A note that I will have to expand on sometime: we can't track time on page for people who exit the page. It's just not possible with standard tags. It comes up a lot, and unless we state it, our stakeholders assume that we can track it: we simply can't.

So: analysis is easy, but interpretation is hard and is open to subjective viewpoints. Our task as experienced, professional analysts is to make sure that our interpretation is in line with the analysis, and is as close to all the data points as possible, so that we tell the right story.

In my next posts in this series, I go on to write about how long to run a test for and explain statistical significance, confidence and when to call a test winner.

Monday, 25 June 2018

Data in Context (England 6 - Panama 1)

There's no denying it, England have made a remarkable and unprecedented start to their World Cup campaign. 6-1 is their best ever score in a World Cup competition, exceeding their previous record of 3-0 against Paraguay and against Poland (both achieved in the Mexico '86 competition). A look at a few data points emphasises the scale of the win:

* The highest ever England win (any competition) is 13-0 against Ireland in February 1882.
* England now share the record for most goals in the first half of a World Cup game (five, joint record with Germany, who won 7-1 against Brazil in 2014).
* The last time England scored four or more goals in a World Cup game was in the final of 1966.
* Harry Kane joins Ron Flowers (1962) as the only players to score in England's first two games at a World Cup tournament.

However, England are not usually this prolific - they scored as many goals against Panama on Sunday as they had in their previous seven World Cup matches in total. This makes the Panama game an outlier; an unusual result; you could even call it a freak result... Let's give the data a little more context:

- Panama are playing in their first World Cup ever, and that they scored their first ever goal in the World Cup against England.
- Panama's qualification relied on a highly dubious (and non-existent) "ghost goal"

- Panama's world ranking is 55th (just behind Jamaica) down from a peak of 38th in 2013. England's world ranking is 12th.
- Panama's total population is around 4 million people. England's is over 50 million. London alone has 8 million. (Tunisia has around 11 million people).

Sometimes we do get freak results. You probably aren't going to convince an England fan about this today, but as data analysts, we have to acknowledge that sometimes the data is just anomalous (or even erroneous). At the very least, it's not representative.

When we don't run our A/B tests for long enough, or we don't get a large enough sample of data, or we take a specific segment which is particularly small, we leave ourselves open to the problem of getting anomalous results. We have to remember that in A/B testing, there are some visitors who will always complete a purchase (or successfully achieve a site goal) on our website, no matter how bad the experience is. And some people will never, ever buy from us, no matter how slick and seamless our website is. And there are some people who will have carried out days or weeks of research on our site, before we launched the test, and shortly after we start our test, they decide to purchase a top-of-the-range product with all the add-ons, bolt-ons, upgrades and so on. And there we have it - a large, high-value order for one of our test recipes which is entirely unrelated to our test, but which sits in Recipe B's tally and gives us an almost-immediate winner. So, make sure you know how long to run a test for.

The aim of a test is to nudge people from the 'probably won't buy' category into the 'probably will buy' category, and into the 'yes, I will buy' category. Testing is about finding the borderline cases and working out what's stopping them from buying, and then fixing that blocker. It's not about scoring the most wins, it about getting accurate data and putting that data into context.

Rest assured that if Panama had put half a dozen goals past England, it would widely and immediately be regarded as a freak result (that's called bias, and that's a whole other problem).

Tuesday, 19 June 2018

When Should You Switch A Test Off? (Tunisia 1 - England 2)

Another day yields another interesting and data-rich football game from the World Cup. In this post, I'd like to look at answering the question, "When should I switch a test off?" and use the Tunisia vs England match as the basis for the discussion.

Now, I'll admit I didn't see the whole match (but I caught a lot of it on the radio and by following online updates), but even without watching it, it's possible to get a picture of the game from looking at the data, which is very intriguing. Let's kick off with the usual stats:

The result after 90 minutes was 1-1, but it's clear from the data that this would be a very one-sided draw, with England having most of the possession, shots and corners. It also appears that England squandered their chances - the Tunisian goalkeeper made no saves, but England could only get 44% of their 18 shots on target (which kind of begs the question - what about the others - and the answer is that they were blocked by defenders). There were three minutes of stoppage time, and that's when England got their second goal.

[This example also shows the unsuitability of the horizontal bar graph as a way of representing sports data - you can't compare shot accuracy (44% vs 20% doesn't add up to 100%) and when one team has zero (bookings or saves) the bar disappears completely. I'll fix that next time.]

So, if the game had been stopped at 90 minutes as a 1-1 draw, it's fair to say that the data indicates that England were the better team on the night and unlucky to win. They had more possession and did more with it.

Comparison to A/B testing

If this were a test result and your overall KPI was flat (i.e. no winner, as in the football game), then you could look at a range of supporting metrics and determine if one of the test recipes was actually better, or if it was flat. If you were able to do this while the test was still running, you could also take a decision on whether or not to continue with the test.

For example, if you're testing a landing page, and you determine that overall order conversion and revenue metrics are flat - no improvement for the test recipe - then you could start to look at other metrics to determine if the test recipe really has identical performance to the control recipe. These could include bounce rate; exit rate; click-through rate; add-to-cart performance and so on. These kind of metrics give us an indication of what would happen if we kept the test running, by answering the question: "Given time, are there any data points that would eventually trickle through to actual improvements in financial metrics?"

Let's look again at the soccer match for some comparable and relevant data points:

* Tunisia are win-less in their last 12 World Cup matches (D4 L8). Historic data indicates that they were unlikely to win this match.

* England had six shots on target in the first half, their most in the opening 45 minutes of a World Cup match since the 1966 semi-final against Portugal. In this "test", England were trending positively in micro-metrics (shots on target) from the start.

* Tunisia scored with their only shot on target in this match, their 35th-minute penalty. Tunisia were not going to score any more goals in this game.

* England's Kieran Trippier created six goalscoring opportunities tonight, more than any other player has managed so far in the 2018 World Cup. "Creating goalscoring opportunities" is typically called "assists" and isn't usually measured in soccer, but it shows a very positive result for England again.

As an interesting comparison - would the Germany versus Mexico game have been different if the referee had allowed extra time? Recall that Mexico won 1-0 in a very surprising result, and the data shows a much less one-sided game. Mexico won 1-0 and, while they were dwarfed by Germany, they put up a much better set of stats than Tunisia (compare Mexico with 13 shots vs Tunisia with just one - which was their penalty). So Mexico's result, while surprising, does show that they did play an attacking game and should have achieved at least a draw, while Tunisia were overwhelmed by England (who, like Germany should have done even better with their number of shots).

It's true that Germany were dominating the game, but weren't able to get a decent proportion of shots on target (just 33%, compared to 40% for England) and weren't able to fully shut out Mexico and score. Additionally, the Mexico goalkeeper was having a good game and according to the data was almost unbeatable - this wasn't going to change with a few extra minutes.

Upcoming games which could be very data-rich: Russia vs Egypt; Portugal vs Morocco.

Other articles I've written looking at data and football

Checkout Conversion: A Penalty Shootout
When should you switch off an A/B test?
The Importance of Being Earnest with your KPIs
Should Chelsea sack Jose Mourinho? (It was a relevant question at the time, and I looked at what the data said)
How Exciting is the English Premier League? what does the data say about goals per game?

Thursday, 21 December 2017

How did a Chemistry Graduate get into Online Testing?

When people examine my CV, they are often intrigued by how a graduate specialising in chemistry transferred into web analytics, and into online testing and optimisation. Surely there's nothing in common between the two?

I am at a slight disadvantage - after all, I can't exactly say that I always wanted to go into website analysis when I was younger. No; I was quite happy playing on my home computer, an Acorn Electron with its 32KB of RAM and 8-bit processor running at 1MHz, and the internet hadn't been invented yet. You needed to buy an external interface just to connect it to a temperature gauge or control an electrical circuit - we certainly weren't talking about the 'internet of things'. But at school, I was good at maths, and particularly good at science which was something I especially enjoyed. I carried on my studies, specialising in maths, chemistry and physics, pursuing them further at university. Along the way, I bought my first PC - a 286 with 640KB memory, then upgraded to a 486SX 25MHz with 2MB RAM, which was enough to support my scientific studies, and enabled me to start accessing the information superhighway.

Nearly twenty years later, I'm now an established web optimization professional, but I still have my interest in science, and in particular chemistry. Earlier this week, I was reading through a chemistry textbook (yes, it's still that level of interest), and found this interesting passage on experimental method. It may not seem immediately relevant, but substitute "online testing" or "online optimisation" for Chemistry, and read on.

Despite what some theoreticians would have us believe, chemistry is founded on experimental work. An investigative sequence begins with a hypothesis which is tested by experiment and, on the basis of the observed results, is ratified, modified or discarded. At every stage of this process, the accurate and unbiased recording of results is crucial to success. However, whilst it is true that such rational analysis can lead the scientist towards his goal, this happy sequence of events occurs much less frequently than many would care to admit.

I'm sure you can see how the practice and thought processes behind chemical experiments translates into care and planning for online testing. I've been blogging about valid hypotheses and tests for years now - clearly the scientific thinking in me successfully made the journey from the lab to the website. And the comment that the "happy sequence of experiment winners happen less frequently than many would care to admit" is particularly pertinent, and I would have to agree with it (although I wouldn't like to admit it). Be honest, how many of your tests win? After all, we're not doing experimental research purely for academic purposes - we're trying to make money, and our jobs are to get winners implemented and make money for our companies (while upholding our reputations as subject-matter experts).

The textbook continues...

Having made the all important experimental observations, transmitting this information clearly to other workers in the field is of equal importance. The record of your observations must be made in such a manner that others as well as yourself can repeat the work at a later stage. Omission of a small detail, such as the degree of purity of a particular reagent, can often render a procedure irreproducible, invalidating your claims and leaving you exposed to criticism. The scientific community is rightly suspicious of results which can only be obtained in the hands of one particular worker!

The terminology is quite subject-specific here, but with a little translation, you can see how this also applies to online testing. In the scientific world, there's a far greater emphasis on sharing results with peers - in industry, we tend to keep our major winners to ourselves, unless we're writing case studies (and ask yourself why do we read case studies anyway?) or presenting at conferences. But when we do write or publish our results, it's important that we do explain exactly how we achieved that massive 197% lift in conversion - otherwise we'll end up "invalidating our claims and leaving us exposed to criticism. The scientific community [and the online community even moreso] is rightly suspicious of results which can only be obtained in the hands of one particular worker!" Isn't that the truth?

Having faced rigorous scrutiny and peer review of my work in a laboratory, I know how to address questions about the performance of my online tests. Working with online traffic is far safer than handling hazardous chemicals, but the effects of publishing spurious or inaccurate results are equally damaging to an online marketer or a laboratory-based chemist. Online and offline scientists alike have to be thoughtful in their experimental practice, rigorous in their analysis and transparent in their methodology and calculations.

Excerpts taken from Experimental Organic Chemistry: Principles and Practice by L M Harwood and C J Moody, published by Blackwell Scientific Publications in 1989 and reprinted in 1990.

Tuesday, 17 October 2017

Quantitative and Qualitative Testing - Just tell me why!

"And so, you see, we achieved a 197% uplift in conversions with Recipe B!"
"Yes, but why?"
"Well, the page exit rate was down 14% and the click-through-rate to cart was up 12%."
"Yes, but WHY?"

If you've ever been on the receiving end of one of these conversations, you'll probably recognise it immediately. You're presenting test results, where your new design has won, and you're sharing the good news with the boss. Or, worse still, the test lost, and you're having to defend your choice of test recipe. You're showing slide after slide of test metrics - all the KPIs you could think of, and all the ones in every big book you've read - and still you're just not getting to the heart of the matter. WHY did your test lose?

No amount of numerical data will fully answer the "why" questions, and this is the significant drawback of quantitative testing. What you need is qualitative testing.

Quantitative testing - think of "quantity" - numbers - will tell you how many, how often, how much, how expensive, or how large. It can give you ratios, fractions and percentages.

Qualitative testing - think of "qualities" - will tell you what shape, what colour, good, bad, opinions, views and things that can't be counted. It will tell you the answer to the question you're asking, and if you're asking why, you'll get the answer why. It won't, however, tell you what the profitability of having a green button instead of a red one will be - it'll just tell you that people prefer green because respondents said it was more calming compared to the angry red one.

Neither is easier than the other to implement well, and neither is less important than the other. In fact, both can easily be done badly. Online testing and research may have placed the emphasis may be on A/B testing, and its rigid, reliable, mathematical nature, in contrast to qualitative testing where it's harder to provide concise, precise summaries, but a good research facility will require practitioners of both types of testing.

In fact, there are cases where one form of testing is more beneficial than the other. If you're building a business case to get a new design fully developed and implemented, then A/B testing will tell you how much profit it will generate (which can then be offset against full development costs). User testing won't give you a revenue figure like that.

Going back to my introductory conversation - quantitative testing will tell you why your new design has failed. Why didn't people click the big green button? Was it because they didn't see it, or because the wording was unhelpful, or because they didn't have enough information to progress? A click-through-rate of 5% may be low, but "5%" isn't going to tell you why. Even if you segment your data, you'll still not get a decent answer to the either-or question.

Let's suppose that 85% of people prefer green apples to red.
Why?
There's a difference between men and women: 95% of men prefer green apples; compared to just 75% of women.
Great. Why? In fact, in the 30-40 year old age group, nearly 98% of men prefer green apples; compared to just 76% of women in the age range.

See? All this segmentation is getting us no closer to understanding the difference - is it colour; flavour or texture??

However, quantitative testing will get you the answer pretty quickly - you could just ask people directly.

You could liken it to quantitative testing being like the black and white outline of a picture, (or, if you're really good, a grey-scale picture) with qualitative being the colours that fit into the picture. One will give you a clear outline, one will set the hues. You need both to see the full picture.

Thursday, 22 June 2017

The General Election (Inferences from Quantitative Data)

The Election

The UK has just had a general election: all the government representatives who sit in the House of Commons have all been selected by regional votes. The UK is split into 650 areas, called constituencies, each of which has an elected Member of Parliament (MP). Each MP has been elected by voting in their constituency, and the candidate with the highest number of votes represents that constituency in the House of Commons.

There are two main political parties in the UK - the Conservative party (pursuing centre-right capitalist policies, and represented by a blue colour), and the Labour party (which pursues more socialist policies, and represented by as red colour). I'll skip the political history, and move directly to the data: the Conservative party achieved 318 MPs in the election; the Labour party achieved 262; the rest were spread between smaller parties. With 650 MPs in total, the Conservative party did not achieve a majority and have had to reach out to one of the smaller parties to reach the majority they require to obtain a working majority.

Anyway: as the results for most of the constituencies had been announced, the news reporters started their job of interviewing famous politicians of the past and present. They asked questions about what this meant for each political party; what this said about the political feeling in the country and so on.

And the Conservative politicians put a brave face on the loss of so many seats. And the Labour politicians contained their delight at gaining so many seats and preventing a Conservative majority.

The pressing issue of the day is Brexit (the UK's departure from the European Union). Some politicians said, "This tells us that the electorate don't want a 'hard' Brexit [i.e. to cut all ties completely with the EU], and that they want a softer approach." - views that they held personally, and which they thought they could infer from the election result. Others said, "This shows a vote against austerity,"; "This vote shows dissatisfaction with immigration." and so on.

The problem is: the question on election day is not, "Which of these policies do you like/dislike?" The question is, "Which of these people do you want to represent you in government?" Anything beyond that is guesswork and supposition - whether that's educated, informed, biased, or speculative.

Website Data
There's a danger in reading too much into quantitative data, and especially bringing your own bias (intentionally or unintentionally) to bear on it. Imagine on a website that 50% of people who reach your checkout don't complete their purchase. Can you say why?

- They found out how much you charge for shipping, and balked at it.
- They discovered that you do a three-for-two deal and went back to find another item, which they found much later (or not at all)
- They got called away from their computer and didn't get chance to complete the purchase
- Their mobile phone battery ran out
- They had trouble entering their credit card number

You can view the data, you can look at the other pages they viewed during their visit. You can even look at the items they had in their basket. You may be able to write hypotheses about why visitors left, but you can't say for sure. If you can design a test to study these questions, you may be able to improve your website's performance. For example, can you devise a way to show visitors your shipping costs before they reach checkout? Can you provide more contextual links to special offers such as three-for-two deals to make it easier for users to spend more money with you? Is your credit card validation working correctly? No amount of quantitative data will truly give you qualitative answers.

A word of warning: it doesn't always work out as you'd expect.

The UK, in its national referendum in June 2016, voted to leave the EU. The count was taken for each constituency, and then total number of votes was counted; the overall result was that "leave" won by 52% to 48%.

However, this varied by region, and the highest leave percentage was in Stoke-on-Trent Central, where 69% of voters opted to leave. This was identified by the United Kingdom Independence Party (UKIP) and their leader, Paul Nuttall, took the opportunity to stand as a candidate for election as an MP in the Stoke-on-Trent Central constituency in February 2017. His working hypothesis was (I assume) that voters who wanted to leave the EU would also vote for him and his party, which puts forward policies such as zero-immigration, reduced or no funding for overseas aid, and so on - very UK-centric policies that you might imagine would be consistent with people who want to leave a multi-national group. However, his hypothesis was disproved when the election results came in:

Labour Party - 7853
UKIP (Paul Nuttall) - 5233
Conservative Party - 5154
Liberal Democrat Party - 2083

He repeated his attempt in a different constituency in the General Election in June; he took 3,308 votes in Boston and Skegness - more than 10,000 fewer votes than the party's result in 2015. Shortly afterwards, he stood down as the leader of UKIP.

So, beware: inferring too much from quantitative data - especially if you have a personal bias - can leave you high and dry, in politics and in website analysis.

Web Optimisation, Maths and Puzzles

Header tag