uyhjjddddddddddd Web Optimisation, Maths and Puzzles: analysis

Header tag

Showing posts with label analysis. Show all posts
Showing posts with label analysis. Show all posts

Sunday, 24 November 2024

Testing versus Implementing - why not just switch it on?

"Why can't we just make a change and see what happens? Why do we have to build an A/B test - it takes too long!  We have a roadmap, a pipeline and a backlog, and we haven't got time."

It's not always easy to articulate why testing is important - especially if your company is making small, iterative, data-backed changes to the site and your tests consistently win (or, worse still, go flat).  The IT team is testing carefully and cautiously, but the time taken to build the test and run it is slowing down everybody's pipelines.  You work with the IT team to build the test (which takes time), it runs (which takes even more time), you analyze the test (why?) and you show that their good idea was indeed a good idea.  Who knew?


Ask an AI what a global IT roadmap looks like...

However, if your IT team is building and deploying something to your website - a new way of identifying a user's delivery address; or a new way of helping users decide which sparkplugs or ink cartridges or running shoes they need - something new, innovative and very different, then I would strongly recommend that you test it with them, even if there is strong evidence for its effectiveness.  Yes, they have carried out user-testing and it's done well.  Yes, their panel loved it.  Even the Head of Global Synergies liked it, and she's a tough one to impress.  Their top designers have spent months in collaboration with the project manager, and their developers have gone through the agile process so many times that they're as flexible as ballet dancers.  They've barely reached the deadline for pre-Christmas implementation, and now is the time to implement it.  It is ready.  However, the Global Integration Leader has said that they must test before they launch, but that's okay as they have allocated just enough time for a pre-launch A/B test, then they'll go live as soon as the test is complete.


Sarah Harries, Head of Global Synergies

Everything hinges on the test launching on time, which it does.  Everybody in the IT team is very excited to see how users engage with the new sparkplug selection tool and - more importantly for everybody else - how much it adds to overall revenue.  (For more on this, remember that clicks aren't really KPIs). 

But the test results come back: you have to report that the test recipe is underperforming at a rate of 6.3% conversion drop.  Engagement looks healthy at 11.7%, but those users are dragging down overall performance.  The page exit rate is lower, but fewer users are going through checkout and completing a purchase.  Even after two full weeks, the data is looking negative.  

Can you really recommend implementing the new feature?  No; but that's not the end of the story.  It's your job to now unpick the data, and turn analysis into insights:  why didn't it win?!

The IT team, understandably, want to implement.  After all, they've spent months building this new selector and the pre-launch data was all positive.  The Head of Global Synergies is asking them why it isn't on the site yet.  Their timeline allowed three weeks for testing and you've spent three weeks testing.  Their unspoken assumption was that testing was a validation of the new design, not a step that might turn out to be a roadblock, and they had not anticipated any need for post-test changes.  It was challenging enough to fit in the test, and besides, the request was to test it.

It's time to interrogate the data.

Moreover, they have identified some positive data points:

*  Engagement is an impressive 11.7%.  Therefore, users love it.
*  The page exit rate is lower, so more people are moving forwards.  That's all that matters for this page:  get users to move forwards towards checkout.
*  The drop in conversion is coming from the pages in the checkout process.  That can't be related to the test, which is in the selector pages.  It must be a checkout problem.

They question the accuracy of the test data, which contradicts all their other data.

* The sample size is too small.
* The test was switched off before it had a chance to recover its 6.3% drop in conversion

They suggest that the whole A/B testing methodology is inaccurate.

* A/B testing is outdated and unreliable.  
* The split between the two groups wasn't 50-50.  There are 2.2% more visitors in A than B.

Maybe they'll comment that the data wasn't analyzed or segmented correctly, and they make some points about this:

* The test data includes users buying other items with their sparkplugs.  These should be filtered out.
* The test data must have included users who didn't see the test experience.
* The data shows that users who browsed on mobile phones only performed at -5.8% on conversion, so they're doing better than desktop users.

Remember:  none of this is personal.  You are, despite your best efforts, criticising a project that they've spent weeks or even months polishing and producing.  Nobody until this point has criticised their work, and in fact everybody has said how good it is.  It's not your fault, your job is to present the data and to provide insights based on it.  As a testing professional, your job is to run and analyse tests, not to be swayed into showing the data in a particular way.

They ran the test at the request of the Global Integration Leader, and burnt three weeks  waiting for the test to complete.  The deadline for implementing the new sparkplug selector is Tuesday, and they can't stop the whole IT roadmap (which is dependent on this first deployment) just because one test showed some negative data.  They would have preferred not to test it at all, but it remains your responsibility to share the test data with other stakeholders in the business, marketing and merchandizing teams, who have a vested interest in the site's financial performance.  It's not easy, but it's still part of your role to present the unbiased, impartial data that makes up your test analysis, along with the data-driven recommendations for improvements.

It's not your responsibility to make the go/no-go decision, but it is up to you to ensure that the relevant stakeholders and decision-makers have the full data set in front of them when they make the decision.  They may choose to implement the new feature anyway, taking into account that it will need to be fixed with follow-up changes and tweaks once it's gone live.  It's a healthy compromise, providing that they can pull two developers and a designer away from the next item on their roadmap to do retrospective fixes on the new selector.  
Alternatively, they may postpone the deployment and use your test data to address the conversion drops that you've shared.  How are the conversion drop and the engagement data connected?  Is the selector providing valid and accurate recommendations to users?  Does the data show that they enter their car colour and their driving style, but then go to the search function when they reach a question about their engine size?  Is the sequence of questions optimal?  Make sure that you can present these kinds of recommendations - it shows the value of testing, as your stakeholders would not be able to identify these insights from an immediate implementation.

So - why not just switch it on?  Here are four good reasons to share with your stakeholders:

* Test data will give you a comparison of whole-site behaviour - not just 'how many people engaged with the new feature?' but also 'what happens to those people who clicked?' and 'how do they compare with users who don't have the feature?'
* Testing will also tell you about  the financial impact of the new feature (good for return-on-investment calculations, which are tricky with seasonality and other factors to consider)
*  Testing has the key benefit that you can switch it off - at short notice, and at any time.  If the data shows that the test recipe is badly losing money then you identify this, and after a discussion with any key stakeholders, you can pull the plug within minutes.  And you can end the test at any time - you don't have to wait until the next IT deployment window to undeploy the new feature. 
* Testing will give you useful data quickly - within days you'll see how it's performing; within weeks you'll have a clear picture.




Wednesday, 21 September 2022

A Quick Checklist for Good Data Visualisation

One thing I've observed during the recent pandemic is that people are now much more interested in data visualisation.  Line graphs (or equivalent bar charts) have become commonplace and are being scrutinised by people who haven't looked at them since they were at school.  We're seeing heatmaps more frequently, and tables of data are being shared more often than usual.  This was prevalent during the pandemic, and people have generally retained their interest in data presentation (although they wouldn't call it that).

This made me consider:  as data analysts and website optimisers, are we doing our best to convey our data as accurately and clearly as possible in order to make our insights actionable.  We want to share information in a way that is easy to understand and easy to base decisions on, and there are some simple ways to do this (even with 'simple' data), even without glamorous new visualisation techniques.

Here's the shortlist of data visualisation rules

- Tables of data should be presented consistently either vertically or horizontally, don't mix them up
- Graphs should be either vertical bars or horizontal bars; be consistent
- If you're transferring from vertical to horizontal, then make sure that top-to-bottom matches left-to-right
- If you use colour, use it consistently and intuitively.

For example, let's consider the basic table of data:  here's one from a sporting context:  the English Premiership's Teams in Form:  results from a series of six games.

PosTeamPPtsFAGDSequence
1Liverpool61613211W W W W W D
2Tottenham6151046W L W W W W
3West Ham61417710D W W W W D

The actual data itself isn't important (unless you're a Liverpool fan), but the layout is what I'm looking at here.  Let's look at the raw data layout:

PosCategory
Metric
1
Metric
2
Metric
3
Metric
4
Derived
metric
Sequence
1Liverpool61613211W W W W W D
2Tottenham6151046W L W W W W
3West Ham61417710D W W W W D


The derived metric "GD" is Goal Difference, the total For minus the total Against (e.g. 13-2=11).

Here, the categories are in a column, sorted by rank, and different metrics are arranged in subsequent columns - it's standard for a league table to be shown like this, and we grasp it intuitively.  Here's an example from the US, for comparison:

PlayerPass YdsYds/AttAttCmpCmp %TDINTRate1st1st%20+
Deshaun Watson48238.95443820.702337112.42210.40669
Patrick Mahomes47408.15883900.663386108.22380.40567
Tom Brady46337.66104010.6574012102.22330.38263


You have to understand American Football to grasp all the nuances of the data, but the principle is the same.   For example, Yds/Att is yards per attempt, which is Pass Yds divided by Att.  Columns of metrics, ranked vertically - in this case, by player.

A real life example of good data visualisation

Here's another example; this is taken from Next Green Car comparison tools:


The first thing you notice is that the categories are arranged in the top row, and the metrics are listed in the first column, because here we're comparing data instead of ranking them.  The actual website is worth a look; it compares dozens of car performance metrics in a page that scrolls on and on.  It's vertical.

When comparing data, it helps to arrange the categories like this, with the metrics in a vertical list - for a start, we're able to 'scroll' in our minds better vertically than horizontally (most books are in a portrait layout, rather than landscape).

The challenge (or the cognitive challenges) come when we ask our readers to compare data in long rows, instead of columns... and it gets more challenging if we start mixing the two layouts within the same document/presentation.  In fact, challenging isn't the word. The word is confusing.

The same applies for bar charts - we generally learn to draw and interpret vertical bars in graphs, and then to do the same for horizontal bars.

Either is fine. A mixture is confusing, especially if the sequence of categories is reversed as well. We read left-to-right and top-to-bottom, and a mixture here is going to be misunderstood almost immediately, and irreversibly.

For example, this table of data (from above)

PosCategory
Metric
1
Metric
2
Metric
3
Metric
4
Derived
metric
Sequence
1Liverpool61613211W W W W W D
2Tottenham6151046W L W W W W
3West Ham61417710D W W W W D


Should not be graphed like this, where the horizontal data has been converted to a vertical layout:
And it should certainly not be graphed like this:  yes, the data is arranged in rows and that's remained consistent, but the sequence has been reversed!  For some strange reason, this is the default layout in Excel, and it's difficult to fix.


The best way to present the tabular data in a graphical form - i.e. putting the graph into a table - is to match the layout and the sequence.

And keep this consistent across all the data points on all the slides in your presentation.  You don't want your audience performing mental gymnastics to make sense of your data.  It would be like reading a book, then having to turn the page by 90 degrees after a few pages, then going back again on the next page, then turning it the other way after a few more pages.  

You want your audience to spend their mental power analysing and considering how to take action on your insights, and not to spend it trying to read your data.

Other articles with a data theme

Friday, 13 May 2022

Website Compromization

Test data, just like any other data, is open to interpretation.  The more KPIs you have, the more the analysis can be pointed towards one winning test recipe or another.  I've discussed this before, and used my long-suffering imaginary car salespeople to show examples of this.

Instead of a clear-cut winner, which is the best on all cases, we often find that we have to select the recipe which is the best for most of the KPIs, or the best for the main KPI, and appreciate that maybe it's not the best design overall.  Maybe the test recipe could be improved if additional design changes were made - but there isn't time to test these extra changes before the marketing team need to get their new campaign live (or the IT team need to deploy the winner in their next launch).  

Do we have enough time to actually identify the optimum design for the site?  Or the page?  Or the element we're testing?  

Anyways - is this science, or is it marketing?  Do we need to make everything on the site perfectly optimized?  Is 'better than control' good enough, or are we aiming for 'even better'?

What do we have?  Is this site optimization, a compromise, or compromization?

Or maybe you have a test result that shows that your users liked a new feature - they clicked on it, they purchased your product.  Does this sound like a success story?  It does, but only until you realise that the new feature you promoted has diverted users' attention away from your most profitable path.  To put it another way, you coded a distraction. 

For example - your new banner promotes new sports laces for your new range of running shoes... so users purchase them but spend less on the actual running shoes.  And the less expensive shoes have a lower margin, so you actually make less profit. Are you trying to sell new laces, or running shoes?

Or you have a new feature that improves the way you sort your search results, with "Featured" or "Recommend" or "Most Relevant" now serving up results that are genuinely what customers want to see.  The problem is, they're the best quality but lowest-priced products in your inventory, so your conversion rate is up by 10% but your average order value is down by 15%.  What do you do?

Are you following customer experience optimization, or compromization?

Sometimes, you'll need to compromise. You may need to sell the new range of shiny accessories with a potential loss of overall profit in order to break into a new market.  You may decide that a new feature should not be launched because although it clearly improves overall customer experience and sales volumes, it would bring down revenue by 5%.  But testing has shown what the cost of the new feature would be (and perhaps a follow-up test with some adjustments would lead to a drop in revenue of only 2%... would you take that?).    In the end, it's going to be a matter of compromization.

Monday, 6 September 2021

It's Not Zero!

 I started this blog many years ago.  It pre-dates at least two of my children, and possibly all three - back in the days when I had time to spare, time to write and time to think of interesting topics to write about.  Nowadays, it's a very different story, and I discovered that my last blog post was back in June.  I used to aim for one blog article per month, so that's two full months with no digital output here (I have another blog and a YouTube channel, and they keep me busy too).

I remember those first few months, though, trying to generate some traffic for the blog (and for another one I've started more recently, and which has seen a traffic jump in the last few days).  

Was my tracking code working?  Was I going to be able to see which pages were getting any traffic, and where they were coming from?  What was the search term (yes, this goes back to those wonderful days when Google would actually tell you your visitors' search keywords)?

I had weeks and weeks of zero traffic, except for me checking my pages.  Then I discovered my first genuine user - who wasn't me - actually visiting my website.  Yes, it was a hard-coded HTML website and I had dutifully copied and pasted my tag code into each page...  did it work?  Yes, and I could prove it:  traffic wasn't zero.

So, if you're in the point (and some people are) of building out a blog, website or other online presence - or if you can remember the days when you did - remember the day that traffic wasn't zero.  We all implemented the tag code at some point; or sent the first marketing email, and it's always a moment of relief when that traffic starts to appear.

Small beginnings:  this is the session graph for the first ten months of 2010, for this blog.  It's not filtered, and it suggests that I was visiting it occasionally to check that posts had uploaded correctly!  Sometimes, it's okay to celebrate that something isn't zero any more.

And, although you didn't ask, here's the same period January-October 2020, which quietly proves that my traffic increases (through September) when I don't write new articles.  Who knew?








Thursday, 24 June 2021

How long should I run my test for?

 A question I've been facing more frequently recently is "How long can you run this test for?", and its close neighbour "Could you have run it for longer?"

Different testing programs have different requirements:  in fact, different tests have different requirements.  The test flight of the helicopter Ingenuity on Mars lasted 39.1 seconds, straight up and down.  The Wright Brothers' first flight lasted 12 seconds, and covered 120 feet.  Which was the more informative test?  Which should have run longer?

There are various ideas around testing, but the main principle is this:  test for long enough to get enough data to prove or disprove your hypothesis.  If your hypothesis is weak, you may never get enough data.  If you're looking for a straightforward winner/loser, then make sure you understand the concept of confidence and significance.

What is enough data?  It could be 100 orders.  It could be clicks on a banner : the first test recipe to reach 100 clicks - or 1,000, or 10,000 - is the winner (assuming it has a large enough lead over the other recipes). 

An important limitation to consider is this:  what happens if your test recipe is losing?  Losing money; losing leads; losing quotes; losing video views.  Can you keep running a test just to get enough data to show why it's losing?  Testing suddenly becomes an expensive business, when each extra day is costing you revenue.   One of the key advantages of testing over 'launch it and see' is the ability to switch the test off if it loses; how much of that advantage do you want to give up just to get more data on your test recipe?

Maybe your test recipe started badly.  After all, many do:  the change of experience from the normal site design to your new, all-improved, management-funded, executive-endorsed design is going to come as a shock to your loyal customers, and it's no surprise when your test recipe takes a nose-dive in performance for a few days.  Or weeks.  But how long can you give your design before you have to admit that it's not just the shock of the new design, (sometimes called 'confidence sickness') but that there are aspects of the new design that need to be changed before it will reach parity with your current site?  A week?  Two weeks?  A month?  Looking at data over time will help here.  How was performance in week 1?  Week 2?  Week 3?  It's possible for a test to recover, but if the initial drop was severe, then you may never recover the overall picture, but if you can find that the fourth week was actually flat (for new and return visitors) then you've found the point where users have adjusted to your new design.

If, however, the weekly gaps are widening, or staying the same, then it's time to pack up and call it a day.

Let's not forget that you probably have other tests in your pipeline which are waiting for the traffic that you're using on your test.  How long can they wait until launch?

So, how long should you run your test for?  As long as possible to get the data you need, and maybe longer if you can, unless it's
- suffering from confidence sickness (keep it running)
- losing badly, and consistently (unless you're prepared to pay for your test data)
- losing and holding up your testing pipeline

Similar posts I've written about online testing

Getting an online testing program off the ground
Building Momentum in Online testing
How many of your tests win?

Wright Brothers Picture:

"Released to Public: Wilber and Orville Wright with Flyer II at Huffman Prairie, 1904 (NASA GPN-2002-000126)" by pingnews.com is marked with CC PDM 1.0

Tuesday, 23 February 2021

Knowing Your KPI is Key

I've written in the past about KPIs, and today I find myself sitting at my computer about to re-tell a story about KPIs - with another twist.

Two years ago, almost to the day, I introduced you all to Albert, Britney and Charles, my three fictitious car salespeople.  Back in 2019, they were selling hybrid cars, and we had enough KPIs to make sure that each of them was a winner in some way (except Albert.  He was our 'control', and he was only there to make the others look good.  Sorry, Albert).

Well, two years on, selling cars has gone online.  Covid-19 and all that means that sales of cars are now handled remotely - with video views, emails, and Zoom calls - and targets have been realigned as a result.  The management team have realised that KPIs need to change in line with the new targets (which makes sense), and there are now a number of performance indicators being tracked.

Here are the results from January 2021 for our three long-standing (or long-suffering) salespeople.







Metric Albert BritneyCharles
Zoom sessions 411 225 510
Calls answered 320 243 366
Leads generated 127 77198
Cars sold 40 5960
Revenue (£) 201,000 285,000203,500
Average car value (£) 5025 48303391
Conversion (contact to lead) 17.4% 16.5%22.6%
Conversion (lead to sale) 31.5% 76.6%30.3%

And again we ask ourselves:  who was the best salesperson?  And, more important, which of the KPIs is actually the KEY performance indicator?

Albert:  had the highest average car value

Britney:  had the highest revenue (40% more than Albert or Charles) and by far the highest conversion from lead to sale.

Charles:  had the most Zoom sessions; calls answered; leads generated; cars sold and conversion from contact  to lead.

Surely Charles won?  Except that wages, overheads and shareholder dividends aren't paid with Zoom sessions; bonuses aren't paid in phone calls and pensions aren't paid with actual cars.

The KPI of most businesses (and certainly this one) is revenue - or, more specifically, profit margin.  It's very nice to be able to talk about other metrics and to use these to improve the business, but if you're a business and your KPI isn't something related to money, then you're probably not aiming for the right target.  

Yes, you can certainly use other metrics to improve the business:  for example, Charles desperately needs to learn how to sell higher-value cars.  He's extremely productive - even prolific - with the customer contacts, but he's £1400 down per car compared to Britney,  and £1600 down per car compared to Albert.  Additionally, if Britney learned to improve her sales conversations and Zoom technique so that it was faster and more efficient, her sales volumes would increase.  This use of data to drive action is extremely helpful, and this will make your analysis actionable.

So:  metrics and KPIs aren't the same thing.  Select the KPI that actually matches the business aim (typically margin and revenue) and don't get distracted by lesser KPIs that are actually just calculated ratios.  Use all the metrics to improve business performance, but pick your winner based on what really matters to your company.

I have looked at KPIs in some my other articles:

The Importance of Being Earnest with your KPIs
Why Test Recipe KPIs are Vital
Web Analytics and Testing - A summary so far



Tuesday, 8 December 2020

A/B testing without a 50-50 split

Whenever people ask me what I do for a living, I [try not to] launch off into a little speech about how I improve website design and experience by running tests, where we split traffic 50-50 between test and control, and mathematically determine which is better.  Over the years, it's been refined and dare I say optimized, but that's the general theme, because that's the easiest way of describing what I do.  Simple.

There is nothing in the rules, however, that says you have to split traffic 50-50.  We typically say 50-50 split because it's a random chance of being split into one of two groups - like tossing a coin, but that's just tradition (he says, tearing up the imaginary rule book).

Why might you want to test on a different split setting?

1.  Maybe your test recipe is so completely 'out-there' and different from control that you're worried that it'll affect your site's KPIs, and you want to test more cautiously.  So, why not do a 90-10?  You only risk 10% of your total traffic, and providing that 10% is large enough to produce a decent sample size, which risk a further 40%?  And if it starts winning, then maybe you increase to an 80-20 split, and move towards 50-50 eventually?

2.  Maybe your test recipe is based on a previous winner, and you want to get more of your traffic into a recipe that should be a winner as quickly as possible (while also checking that it is still a winner).  So you have the opportunity to test on a 10-90 split, with most of your traffic on the test experience and 10% held back as a control group to confirm your previous winner.

3.  Maybe you need test data quickly - you are confident you can use historic data for the control group, but you need to get data on the test page/site/experience, and for that, you'll need to funnel more traffic into the test group.  You can use a combination of historic data and control group data to measure the current state performance, and then get data on how customers interact with the new page (especially if you're measuring clicks on a new widget on the page, and how customers like or dislike it).

4.  Maybe you're running a Multi-Armed Bandit test.

Things to watch out for

If you decide to run an A/B test on uneven splits, then beware:

- You need to emphasise conversion rates, and calculate your KPIs as "per visitor" or "per impression".  I'm sure you do this already with your KPIs, but absolute numbers of orders or clicks, or revenue values will not be suitable here.  If you have twice as much traffic in B compared to A (a 66-33 split), then you should expect twice as many success events from an identical success rate; you'll need to divide by visit, visitor or page view (depending on your metric, and your choice).

- You can't do multivariate analysis on uneven splits - as I mentioned in my articles on MVT analysis, you need equal-ish numbers of visits in order to combine the data from the different recipes.


Friday, 6 March 2020

Analysis versus Interpretation

We have had a disappointingly mild winter.

It snowed on two days...


You will easily notice the bias in that sentence. Friends and long-time readers will know that I love snow, for many reasons. The data from the Meteorological Office puts the winter (1 December - 29 February) into context, using a technique that I've mentioned before - ranking the specific period against the rest of the data set.


So, by any measure, it was a wet and mild winter. Far more rain than usual (across the country), and temperatures were above average.

This was posted on Facebook, a website renowned for its lack of intelligent and considered discussion, and known for the sharp-shooting debates.  Was it really wetter than usual? Is global warming to blame? Is this an upward trend (there is insufficient data here) or a fluke?

And then there's the series of distraction questions - how long have records been held? Have the temperature and rainfall data been recorded since the same original date? Is any of that relevant? No.

In my experience, analysis is hard, but anybody, it seems, can carry out the interpretation.  However, interpretation is wide open to personal basis, and the real skill is in treating the data impartially and without bias, and interpreting it from that viewpoint. It requires additional data research - for example, is February's data an anomaly or is it a trend? Time to go and look in the archive and support your interpretation with more data.


Friday, 24 January 2020

Project Management: A Trip To The Moon

Scene: meeting room, some people dialling in remotely. The plan is to launch a manned rocket to the moon, and the project manager (PM) is kicking off the project.
PM "Right, team, let's plan this space journey to the moon. What kind of fuel will we use in our rocket?"
Designer 'Before I answer that, we want to discuss the colour of the nose cone. The design is to paint it blue.'
PM "Okay, blue is fine. Have you had any thoughts about the engine?"
Designer 'No, but we actually think a red nosecone might be better.'
PM "Noted. Let's move on from that, and come back to it nearer the launch time."
Marketing: We thought blue. Now, how we will we choose the pilots? PM "I was thinking that we would have a rigorous selection process."
Marketing: "We can do that. But we'd like to address the name of the spaceship. Our subsidiary want to call it the USSS Pegasus. We want to refer to it as the US Pegasus - the 'SS' was a suggestion from our previous owner. As this is a combined program, we're going to go with the US Pegasus."
PM "Noted. The US Pegasus. Now, about the pilots..."
Designer "And the name of the ship must be in blue text."
PM [making notes] "...blue text..."
Designer "To match the nose cone." PM "Now, circling back to the question of the pilots."
Stakeholder: "Oh, you can't say that. Circling back suggests that the ship isn't going to land on the moon." PM "Sure. So let's go on to the pilots?"
Stakeholder; "Yes, we can sort that out." PM "Thanks. Now - timelines. Do you have a target date for landing on the moon?" Stakeholder; "Yes, we want to land on 28 July, 2020. When do you need to launch?"
PM "How long will the flight take?" Stakeholder "That depends on the fuel." PM "Doesn't it depend on the engine?" Marketing "Possibly. But it's important that we land on 28 July." Stakeholder "Yes. 28 July. We've set that date with our president. It's his birthday"
PM "So who can give me the launch date?"
Stakeholder "Well, we expected you to provide that." PM "Okay, let's assume it takes four days to reach the moon. Can you have everything built and fuelled by then?" Stakeholder "And we'll want to check everything works." PM "Like a test launch?" Marketing "Oh no, we can't have a test launch. We can't have our competitors knowing what we're doing."
PM "No test launch?" Marketing "No." PM "And the pilots?" Stakeholder "I'm working on it." PM "And the fuel?" Stakeholder "I'll find somebody. Somebody somewhere must know something about it."
Marketing "And we'll need hourly readouts on speed. Preferably minute by minute. And oxygen levels; distance from the earth; internal and external temperatures. All those things." PM "Are you interested in the size of the engine?"
Stakeholder "We've been planning this for six months already. We know it'll need an engine." Engineer; "Sorry I'm late, I've just joined." PM "Thanks for joining. We're just discussing the rocket engine. Do you know what size it will be?" Engineer: "Big." PM "Big enough?" Engineer: "Yes. 1000 cubic units. Big enough." PM: "Great. Thanks. Let's move on." Stakeholder: "Wait, let's just check on that detail. Are you sure?"

Engineer; "Yes. I've done the calculations. It's big enough." Stakeholder: "To get to the moon?" Engineer: "Yes." Stakeholder: "And back?" Engineer: "Yes." Designer: "Even if we have blue text instead of red?"

Engineer: "Yes."
Marketing; "What about if we have red text."
Engineer; "The colour of the text isn't going to affect the engine performance." Stakeholder "Are you sure?"
Engineer: "We're not burning the paint as fuel. We're not painting the engine. We're good." PM: "Thank you. Now; how much fuel do you need?"
Engineer: "That depends. How quickly do you want to get there?" PM: "We need to land on the moon on 28 July 2020. I've estimated a four-day flight time." Engineer; "I'd make it five days, to be on the safe side, and I would calculate 6000 units of class-one fuel, approximately." PM: "Okay, that sounds reasonable. Will the number of pilots affect the fuel calculation?" Engineer: "Yes, but it won't significantly change the 6000 units estimate. When you know the number and mass of the pilots, we can calculate the fuel tank size we'll need."
Stakeholder; "But we won't know that until launch." PM: "Until launch?" Stakeholder: "Yes. We don't know how many people we want to send to the moon until the day of the launch." PM: "And the colour of the text? And the nose cone? And the actual text."
Stakeholder: "Will all depend on people we send."
PM: "No test launch?" Marketing; "No. We need this to be secret so that our competitors don't know what we're doing." PM: "So we're launching an undetermined number of people, in an untested rocket of unknown name and size, to the moon, with an approximate flight time and fuel load, at some point in the future."
Marketing: "But it must land on 28 July." PM: "2020, yes. Ok, We've run out of time for today, but let's catch up tomorrow with progress. Between now and then, let's work to decide some of the smaller details like the fuel and the engine, and tomorrow we can cover the main areas, such as the size of the rocket and where it's going. Thank you, everybody. Goodbye for now."

Tuesday, 21 May 2019

Three-Factor Multi-Variate Testing

TESTING ALL POSSIBILITIES WITHOUT TESTING EVERYTHING

My favourite part of my job is determining what to test, and planning how to run a test.  I enjoy the analysis afterwards, but the most enjoyable part of the testing process is deciding what the test recipes will actually be.  I've covered - at length - test design and planning, and also multi-variate testing.  I particularly enjoy multi-variate testing, since it simply allows you test all possibilities without having to test everything.


In my previous posts, where I introduced MVT, I've only covered two-factor MVT: should this variable be black or red?  Should it a picture of a man or a woman?  Should it say 'Special offer' or 'Limited time'?  Is it x or is it y?  How do you analyse MVT results? In this post, I'm going to take the discussion of testing one step further, and look at three-factor multi-variate testing:  should it be x, y or z?


Just as there are limited opportunities for MVT, the range of opportunities for three-factor MVT is potentially even more limited.  However, I'd like to explain that this doesn't have to be the case, and that it just takes careful planning to determine when and how to set up a test where there are three possible 'best answers'.


SCENARIO


You run a domestic travel agency, which specialises in arranging domestic travel for customers across the country (this works better if you imagine it in the US, but it works for smaller countries too).  You provide a full door-to-door service, handling everything from fuel, insurance, tickets, transfers  - whatever it takes, you can do it.  Consequently, you are in high demand around Christmas and Thanksgiving (see, I told you this worked better in the US), and potentially other holiday periods.  Yes, you're a travel agency firm based on Planes, Tranes and Automobiles.


It's the run-up to the largest sales time of the year, as you prepare to reunite distant family members across the country for those big family celebrations and parties and whatever else.  What do you lead with on your website's homepage?


Planes?

Trains?
Or automobiles?

If you want to include buses, look out for a not-yet-planned post on four-factor MVT.  I'll have it ready by Christmas.


So far, this would be a straightforward A/B/C test, with a plane, a car and a train.  Your company colours are yellow, so let's go with that:



Your marketing team are also unsure how to lead with their messaging - should they emphasise price, reliability, or an emotional connection?


They can't choose between

"Cross the country without costing the world" (price)
"Guaranteed door-to-door on time, every time" (reliability)
"Bring your smile to their doorstep this holiday" (emotional)

So now we have nine recipes, A-I.


A: Plane plus Price
B:  Plane plus reliability
C: Plane plus emotions

D: Car plus Price
E:  Car plus reliability
F: Car plus emotions


G: Train plus price
H: Train plus reliability
I: Train plus emotions


Now, somebody in the exec suite has decided that now might be the time to try out a new set of corporate colours.  Yellow is bright and cheery, but according to the exec, it can be seen as immature, and not very sophisticated.  The alternatives are red and blue (plus the original yellow).


Here goes:  there are now 3x3x3 possible variations - that's 27 altogether.  And you can't run a test with 27 recipes - for a start, there aren't enough letters in the alphabet.  There's also traffic and timing to consider - it will take months to run a test like that to get any level of significance.  Nevertheless, this is an executive request, so we'll have to make it happen.


Firstly, the visuals:  if this was just a two-variable test, then we'd have nine recipes, as you can see below.



















However, each of these vehicle/colour combinations has three more options (based on the marketing message that we select) - here is a small sample of the 27 total combinations, to give you an idea.










          
   
This is not a suitable testing set, but it gives you an idea of the total variations that we're looking at.  The next step, as we did with the more straightforward two-factor MVT, is to identify our orthogonal set - the minimum recipes that we could test that would give us sufficient information to infer the performance of the recipes that we don't test.  It's time to charge up your spreadsheet.

THE RECIPES - AN ORTHOGONAL SET

There are 3*3*3 = 27 different combinations of colour, text and vehicle... here's the list, since you're wondering ;-)



Recipe Colour Vehicle Message
A Red Plane Price
B Red Plane Reliability
C Red Plane Emotions
D Red Train Price
E Red Train Reliability
F Red Train Emotions
G Red Car Price
H Red Car Reliability
I Red Car Emotions
J Blue Plane Price
K Blue Plane Reliability
L Blue Plane Emotions
M Blue Train Price
N Blue Train Reliability
O Blue Train Emotions
P Blue Car Price
Q Blue Car Reliability
R Blue Car Emotions
S Yellow Plane Price
T Yellow Plane Reliability
U Yellow Plane Emotions
V Yellow Train Price
W Yellow Train Reliability
X Yellow Train Emotions
Y Yellow Car Price
Z Yellow Car Reliability
AA Yellow Car Emotions


The recipes with the faint green shading would form a simple orthogonal set; here they are for clarity:

Recipe Colour Vehicle Message
A Red Plane Price
E Red Train Reliability
I Red Car Emotions
K Blue Plane
Reliability
O Blue Train Emotions
P Blue Car Price
U Yellow Plane
Emotions
V Yellow Train Price
Z Yellow Car Reliability


Note that each colour, vehicle and message appear three times each; there are therefore nine recipes that we need.  This is still a considerable number, but it's a significant saving from 27 in total.

THE ANALYSIS

Which colour?  How to find the best variation for each element


Select the recipes which will give us a reading on the best colour by choosing recipes where the other variants cancel to noise:


This is simple (and simpler than the two-factor version):  we simply add the results for all the "red" recipes, and compare with the sum of all the "blue" recipes and, compare with the data for all the "yellow" recipes.


Let's take a look at some hypothetical data, based on the orthogonal recipe set shown above:


Recipe

a

e

i

k

o

p

u

v

z

Visits

1919

1922

1932

1939

1931

1934

1915

1955

1944

Bookings

193

194

189

194

205

192

200

209

206

Revenue (k)

£14.2

£14.6

£14.4

£14.3

£15.6

£13.94

£14.8

£15.7

£15.4

Conversion

10.1%

10.1%

9.8%

10.0%

10.6%

9.9%

10.4%

10.7%

10.6%

Lift

-

0.4%

-2.7%

-0.5%

5.6%

-1.3%

3.8%

6.3%

5.4%

Avg Booking Value

 £73.58

 £75.26

 £76.19

 £73.71

 £76.10

 £72.60

 £74.00

 £75.12

 £74.76

Lift - 2.3% 3.6% 0.2% 3.4% -1.3% 0.6% 2.1% 1.6%
RPV  £7.40  £7.60  £7.45  £7.37  £8.08  £7.21  £7.73  £8.03  £7.92
Lift - 2.7% 0.7% -0.3% 9.2% -2.6% 4.4% 8.5% 7.1%


I've shown the raw metrics and the calculated metrics for the recipes, but it's important to remember at this point:  the recipes shown here probably won't include the best recipe.  After all, we're testing nine recipes out of a total of 27, so we have only a one in three chance of selecting the optimum combination.
What we need to do next, as I mentioned above, is to combine the data for all the yellow recipes, and compare with the red and the blue.



Recipes
aei kop uvz
Colour
Red Blue Lift vs Red Yellow Lift vs Red
Visits
5773
5804
5814
Bookings
576
591
615
Revenue (k)
£43.2
£43.84
£45.9
Conversion
9.98%
10.18%
2.1%
10.58%
6.0%
ABV
75.00
74.18
-1.1%
74.63
-0.5%
RPV
7.48
7.55
0.9%
7.89
5.5%


So we can see from our simple colour analysis (adding all the results for the recipes which contain Red, vs Blue, vs Yellow) that Yellow is the best.  The Conversion has a 6% lift, and while Average Booking Value is slightly lower, the Revenue Per Visit is still 5.5% higher for the yellow recipes than it is for the Red.

Now we do the same for the vehicles: plane, train or car?
Recipes aku eov ipz
Vehicle Plane Train Lift
vs Plane
Car Lift
vs Plane
Visits 5773 5808 5810
Bookings 587 608 587
Revenue (k) £43.3 £45.9 £43.74
Conversion 10.17% 10.47% 3.0% 10.10% -0.6%
ABV 73.76 75.49 2.3% 74.51 1.0%
RPV 7.50 7.90 5.4% 7.53 0.4%

Clear winner in this case:  it's Train, which is the best for conversion, average booking value and revenue per visit.

And finally, the messaging:  emotional, price or reliability?

Recipes apv iou ekz
Message Price Emotion Lift vs Price Reliability Lift vs Price
Visits 5808 5778 5805
Bookings 594 594 594
Revenue 43.84 44.8 44.3
Conversion 10.23% 10.28% 0.5% 10.23% 0.1%
ABV 73.80 75.42 2.2% 74.58 1.0%
RPV 7.55 7.75 2.7% 7.63 1.1%

And in this case, it's Emotion which is the best, with clearly better average booking value and revenue.  It would appear that price is not the best way to lead your messaging.

CONCLUSION AND THOUGHTS

The best combination is:
Yellow Train, with Emotion messaging.


Notice that the performance of the recipes that we actually tested is in agreement with the winning combination (based on the calculations)


Recipes that contain none of the winning elements performed the worst:

A  - Red Plane, Price :  RPV £7.40
K - Blue Plane, Reliability:  RPV £7.37
P - Blue Car, Price  :  RPV £7.21

Recipes that contain just one of the winning elements produced slightly  better results:

E - Red Train, Reliability:  £7.60
I - Red Car, Emotions:  £7.45

Z - Yellow Car, Reliability: £7.92*

Recipes that contained two of the three winning elements were the best performers:

O - Blue Train, Emotions:  £8.08
U - Yellow Plane, Emotions:  £7.73
V - Yellow Train, Price: £8.03


I would strongly recommend running a follow-up test, with the two winners from the first selection (O and V) along with the proposed winner based on the analysis, Yellow Train with Emotions.  It's possible that this proposed winner will be the best; there's also the possibility that it may be close to but not as good as O or V. 

*There's also an argument for including Z (Yellow Car, Reliability) as an outlier, given its performance.  


There are some clear losers that do not need to be pursued:  notice how two of the bottom three performing recipes contain Blue and Price.  All of the Price recipes that we tested - A, P and V, had lower than typical Average Booking Value, and this includes recipe V, which was one of the best recipes.  With a different message (Emotions, most likely), Recipe V would be a runaway success.

It's not surprising that a follow-up is needed; remember that we've only tested nine out of 27 combinations, and it's unlikely that we'll have hit the optimum design first time around.  However, by careful selection of our original recipes, we need only test four more (at the most) to identify the best from all 27.  Finding the best combination from 27, by only testing 13 is a definite winner.  This is the power of multi-variate testing: the ability to test all possibilities without having to test everything.

Here's my series on Multi Variate Testing

Preview of Multi Variate testing
Web Analytics: Multi Variate testing 
Explaining complex interactions between variables in multi-variate testing
Is Multi Variate Testing an Online Panacea - or is it just very good?
Is Multi Variate Testing Really That Good 
Hands on:  How to set up a multi-variate test
And then: Three Factor Multi Variate Testing - three areas of content, three options for each!