Web Optimisation, Maths and Puzzles: web analytics

Showing posts with label web analytics. Show all posts

Sunday, 24 November 2024

Testing versus Implementing - why not just switch it on?

"Why can't we just make a change and see what happens? Why do we have to build an A/B test - it takes too long! We have a roadmap, a pipeline and a backlog, and we haven't got time."

It's not always easy to articulate why testing is important - especially if your company is making small, iterative, data-backed changes to the site and your tests consistently win (or, worse still, go flat). The IT team is testing carefully and cautiously, but the time taken to build the test and run it is slowing down everybody's pipelines. You work with the IT team to build the test (which takes time), it runs (which takes even more time), you analyze the test (why?) and you show that their good idea was indeed a good idea. Who knew?

Ask an AI what a global IT roadmap looks like...

However, if your IT team is building and deploying something to your website - a new way of identifying a user's delivery address; or a new way of helping users decide which sparkplugs or ink cartridges or running shoes they need - something new, innovative and very different, then I would strongly recommend that you test it with them, even if there is strong evidence for its effectiveness. Yes, they have carried out user-testing and it's done well. Yes, their panel loved it. Even the Head of Global Synergies liked it, and she's a tough one to impress. Their top designers have spent months in collaboration with the project manager, and their developers have gone through the agile process so many times that they're as flexible as ballet dancers. They've barely reached the deadline for pre-Christmas implementation, and now is the time to implement it. It is ready. However, the Global Integration Leader has said that they must test before they launch, but that's okay as they have allocated just enough time for a pre-launch A/B test, then they'll go live as soon as the test is complete.

Sarah Harries, Head of Global Synergies

Everything hinges on the test launching on time, which it does. Everybody in the IT team is very excited to see how users engage with the new sparkplug selection tool and - more importantly for everybody else - how much it adds to overall revenue. (For more on this, remember that clicks aren't really KPIs).

But the test results come back: you have to report that the test recipe is underperforming at a rate of 6.3% conversion drop. Engagement looks healthy at 11.7%, but those users are dragging down overall performance. The page exit rate is lower, but fewer users are going through checkout and completing a purchase. Even after two full weeks, the data is looking negative.

Can you really recommend implementing the new feature? No; but that's not the end of the story. It's your job to now unpick the data, and turn analysis into insights: why didn't it win?!

The IT team, understandably, want to implement. After all, they've spent months building this new selector and the pre-launch data was all positive. The Head of Global Synergies is asking them why it isn't on the site yet. Their timeline allowed three weeks for testing and you've spent three weeks testing. Their unspoken assumption was that testing was a validation of the new design, not a step that might turn out to be a roadblock, and they had not anticipated any need for post-test changes. It was challenging enough to fit in the test, and besides, the request was to test it.

It's time to interrogate the data.

Moreover, they have identified some positive data points:

* Engagement is an impressive 11.7%. Therefore, users love it.

* The page exit rate is lower, so more people are moving forwards. That's all that matters for this page: get users to move forwards towards checkout.

* The drop in conversion is coming from the pages in the checkout process. That can't be related to the test, which is in the selector pages. It must be a checkout problem.

They question the accuracy of the test data, which contradicts all their other data.

* The sample size is too small.

* The test ran for too long/did not run for long enough

* The test was switched off before it had a chance to recover its 6.3% drop in conversion

They suggest that the whole A/B testing methodology is inaccurate.

* A/B testing is outdated and unreliable.

* The split between the two groups wasn't 50-50. There are 2.2% more visitors in A than B.

Maybe they'll comment that the data wasn't analyzed or segmented correctly, and they make some points about this:

* The test data includes users buying other items with their sparkplugs. These should be filtered out.

* The test data must have included users who didn't see the test experience.

* The data shows that users who browsed on mobile phones only performed at -5.8% on conversion, so they're doing better than desktop users.

Remember: none of this is personal. You are, despite your best efforts, criticising a project that they've spent weeks or even months polishing and producing. Nobody until this point has criticised their work, and in fact everybody has said how good it is. It's not your fault, your job is to present the data and to provide insights based on it. As a testing professional, your job is to run and analyse tests, not to be swayed into showing the data in a particular way.

They ran the test at the request of the Global Integration Leader, and burnt three weeks waiting for the test to complete. The deadline for implementing the new sparkplug selector is Tuesday, and they can't stop the whole IT roadmap (which is dependent on this first deployment) just because one test showed some negative data. They would have preferred not to test it at all, but it remains your responsibility to share the test data with other stakeholders in the business, marketing and merchandizing teams, who have a vested interest in the site's financial performance. It's not easy, but it's still part of your role to present the unbiased, impartial data that makes up your test analysis, along with the data-driven recommendations for improvements.

It's not your responsibility to make the go/no-go decision, but it is up to you to ensure that the relevant stakeholders and decision-makers have the full data set in front of them when they make the decision. They may choose to implement the new feature anyway, taking into account that it will need to be fixed with follow-up changes and tweaks once it's gone live. It's a healthy compromise, providing that they can pull two developers and a designer away from the next item on their roadmap to do retrospective fixes on the new selector. Alternatively, they may postpone the deployment and use your test data to address the conversion drops that you've shared. How are the conversion drop and the engagement data connected? Is the selector providing valid and accurate recommendations to users? Does the data show that they enter their car colour and their driving style, but then go to the search function when they reach a question about their engine size? Is the sequence of questions optimal? Make sure that you can present these kinds of recommendations - it shows the value of testing, as your stakeholders would not be able to identify these insights from an immediate implementation.

So - why not just switch it on? Here are four good reasons to share with your stakeholders:

* Test data will give you a comparison of whole-site behaviour - not just 'how many people engaged with the new feature?' but also 'what happens to those people who clicked?' and 'how do they compare with users who don't have the feature?'

* Testing will also tell you about the financial impact of the new feature (good for return-on-investment calculations, which are tricky with seasonality and other factors to consider)

* Testing has the key benefit that you can switch it off - at short notice, and at any time. If the data shows that the test recipe is badly losing money then you identify this, and after a discussion with any key stakeholders, you can pull the plug within minutes. And you can end the test at any time - you don't have to wait until the next IT deployment window to undeploy the new feature.

* Testing will give you useful data quickly - within days you'll see how it's performing; within weeks you'll have a clear picture.

Similar posts I've written about online testing

Getting an online testing program off the ground
Building Momentum in Online testing
How many of your tests win?

Monday, 18 November 2024

Designing Personas for Design Prototypes

Part of my job is validating (i.e. testing and confirming) new designs for the website I work on. We A/B test the current page against a new page, and confirm (or otherwise) that the new version is indeed better than what we have now. It's often a last-stop measure before the new design is implemented globally, although it's not always a go/no-go decision.

The new design has gone through various other testing and validation first - a team of qualified user experience designers (UX) and user interface designers (UI) will have decided how they want to improve the current experience. They will have undertaken various trials with their designs, and will have built prototypes that will have been shown to user researchers; one of the key parts of the design process, somewhere near the beginning, is the development of user personas.

A persona in this context is a character that forms a 'typical user', who designers and product teams can keep in mind while they're discussing their new design. They can point to Jane Doe and say, "Jane would like this," or, "Jane would probably click on this, because Jane is an expert user."

I sometimes play Chess in a similar way, when I play solo Chess or when I'm trying to analyze a game I'm playing. I make a move, and then decide what my opponent would play. I did this a lot when I was a beginner, learning to play (about 40 years ago) - if I move this piece, then he'll move that piece, and I'll move this piece, and I'll checkmate him in two moves! This was exactly the thought process I would go through - making the best moves for me, and then guessing my opponent's next move.

It rarely worked out that way, though, when I played a real game. Instead, my actual opponent would see my plans, make a clever move of his own and capture my key piece before I got chance to move it within range of his King.

Underestimating (or, to quote a phrase, misunderestimating) my opponent's thoughts and plans is a problem that's inherent with playing skill and strategy games like Chess. In my head, my opponent can only play as well as I can.

However, when I play solo, I can make as many moves as I like, but both sides can do whatever I like, and I can win because I constructed my opponent to follow the perfect sequence of moves to let me win. And I can even fool myself into believing that I won because I had the better ideas and the best strategy.

And this is a common pitfall among Persona Designers (I've written a whole series on the pitfalls of A/B testing). They impose too much of their own character onto their persona, and suddenly they don't have a persona, they have a puppet.

"Jane Doe is clever enough to scroll through the product specifications to find the compelling content that will answer all her questions."

"Joe Bloggs is a novice in buying jewellery for his wife, so he'll like all these pretty pictures of diamonds."

"John Doe is a novice buyer who wants a new phone and needs to read all this wonderful content that we've spent months writing and crafting."

This is something similar to the Texas Sharpshooter Fallacy (shooting bullets at the side of a barn, then painting the target around them to make the bullet holes look like they hit it). That's all well and good, until you realize that the real customers who will spend real money purchasing items from our websites, have a very real target that's not determined by where we shoot our bullets. We might even know the demographics of our customers, but even that doesn't mean we know what (or how) they think. We certainly can't imbue our personas with characters and hold on to them as firmly as we do in the face of actual customer buying data that shows a different picture. So what do we do?

"When the facts change, I change my mind. What do you do, sir?"
Paul Samuelson, Economist,1915-2009

Friday, 17 May 2024

Multi-Armed Bandit Testing

I have worked in A/B testing for over 12 years, and blogged about it extensively. I've covered how to set up a hypothesis, how to test iteratively and even summarized the basics of A/B testing. I ran my first A/B test on my own website (long since deleted and now only in pieces on a local hard-drive) about 14 years ago. However, it has taken me this long to actually look into other ways of running online A/B tests apart from the equal 50-50 split that we all know and love.

My recent research led me to discover multi-armed bandit testing, which sounds amazing, confusing and possibly risky (don't bandits wear black eye-masks and operate outside the law??).

What is multi-armed bandit testing?

The term multi-armed bandit comes from a mathematical problem, which can be phrased like this:

A gambler must choose between multiple slot machines, or "one-armed bandits", each which has a different, unknown, likelihood of winning. The aim is to find the best or most profitable outcome by a series of choices. At the beginning of the experiment, when odds and payouts are unknown, the gambler must try each one-armed bandit to measure their payout rate, and then find a strategy to maximize winnings.

Over time, this will mean putting more money into the machine(s) which provide the best return.

Hence, the multiple one-armed bandits make this the “multi-armed bandit problem,” from which we derive multi-armed bandit testing.

The solution - to put more money into the machine which returns the best prizes most often - translates to online testing:, the testing platform dynamically changes the allocation of new test visitors to the recipes which are showing the best performance so far. Normally, traffic is allocated randomly between the recipes, but with multi-armed bandit testing traffic is skewed towards the winning recipe(s). Instead of the normal 50-50 split (or 25-25-25-25, or whichever), the traffic splits on a daily (or by visit) day.

We see two phases of traffic distribution while the test is running: initially, we have the 'exploration' phase, where the platform tests and learns, measuring which recipe(s) are providing the best performance (insert your KPI here). After a potential winner becomes apparent, the percentage of traffic to that recipe starts to increase, while the losers see less and less traffic. Eventually, the winner will see the vast majority of traffic - although the platform will continue to send a very small proportion of traffic to the losers, to continue to validate its measurements, and this is the 'exploitation' phase.

The graph for the traffic distribution over time may look something like this:

...where Recipe B is the winner.

So, why do a multi-armed bandit test instead of a normal A/B test?

If you need to test, learn and implement in a short period of time, then multi-armed may be the way forwards. For example, if marketing want to know which of two or three banners should accompany the current sales campaign (back to school; Labour Day; holiday weekend), you aren't going to have time to run the test, analyze the results and push the winner. The campaign ended while you were tinkering with your spreadsheets. With multi-armed bandit, the platform identifies the best recipes while the test is running, and implements it while the campaign is still active. When the campaign has ended, you will have maximized your sales performance by showing the winner while the campaign was active.

Friday, 30 June 2023

Goals (and why I haven't posted recently)

I'll probably blog sometime soon about goals, objectives, strategies and measures. They're important in business, and useful to have in life generally. For now, though, I'll have to explain why I haven't blogged much recently at all: I've found a new (old) hobby: constructing Airfix models. I started with Airfix models when I was about 10 or 11 years old - old enough to be patient to wait for the glue to dry, and careful enough to plan how to construct each model. I had a second wave of interest in my late teens and early 20s, and more recently earlier this year (courtesy of my 11-year-old, now 12-year-old son).

So this is what's been filling my time - building with my son.

Here's my first solo-ish project for 25 years:

The set is the Airfix 25 pdr Field Gun with Quad - one that I bought and built during my time at university. I enjoyed the set, mostly because of the various figures that come with the set. I've not painted any of my sets in snow camouflage before - and you'll soon see that I'm not a stickler for historical accuracy: I paint what I like!

I identified this figure as a troop commander (his flat cap compares with the helmets that the rest of the troop are wearing). The Quad truck has a gap in the roof, and I decided I was going to stand the commander in the vehicle, peering through the roof. Yes, he's a very easy target standing there like that, but I figure - why not?

I drew out the overall scene on a piece of wooden board, sketching the position of the vehicle, trailer and gun, and the key figures. We also have some injured casualties in our collection, and they featured too. The trees were obtained cheaply from Amazon, and they are cheap, low-quality and quite small for 1/72 scale. I used scenic roll (green) and white spray paint (generic matt paint) to deliver the snow. The spray paint was cheap and it didn't spray evenly, but that worked to my advantage to give patchy but heavy coverage.

The final diorama included a Metcalfe model pillbox (the square version), some additional bushes (not shown here) and a good complement of trees. I added some crater marks (but no depth to the scene) to explain the casualties, and then added some medics too (they were trickier).

Next? A village scene, with a pair of Tigers ploughing through the remains of a continental village (somewhere). As ever, it's all about the modelling, and has very little to do with historical accuracy!

Sunday, 30 April 2023

Personalization, Segmentation or Targeting

Following all my recent posts on targeting (or personalization), I was discussing website content changes with a colleague. I was explaining how we could test some form of interactive, real-time changes on our site. His comments were that this wasn't real 1-to-1 personalization and what I was actually doing was just segmentation and content retargeting. This started me thinking, and so I'd like to share my thoughts on 1-to-1 targeting is possible, easy and worth the effort. Or should we be satisfied with segmentation and retargeting?

1-to-1 targeting requires the ability to show any content to any user. It probably needs a hige repository of content that can be accessed to show content that isn't shown to other users, but which is deemed optimal for a particular user.

1. How do you decide which type of user this particular user should be classed as?

2. How do you determine which content to show this particular user (or type of user)?

3. When the targeting doesn't give great results, how can you tell if the problem is with 1. or 2.?

And, as a follow-up question, why is "targeted" content drawn from a library held in higher esteem than retargeting existing content? Is it better because it's so difficult to set up?

Content retargeting - moving existing content on the page - does not require new content, but "isn't real 1-to-1 targeting." This is true, but I would argue that the difference - mathematically at least - is negligible. The huge library of targeted content isn't going to be able to match the potential combinations of content that can be achieved just by flipping page content around to promote a particular group of products.

In previous examples on targeting, I've looked at having four product categories that can be targeted.

How many combinations are there for the four products A, B, C, D?

4 * 3 * 2 * 1 = 24

There are four options for the first placement, leaving three for the second placement, two options for the third and only one left for the final place.

This is a relatively simple example - most websites have more than just four products or product categories in their catalogue (even Apple, with its limited product range, has more than four).

Let's jump up to six products:

6 * 5 * 4 * 3 * 2 * 1 = 720.

At this point, retargeting is going to start scaling far more easily than 1-to-1 personalization.

Admittedly, it's highly unlikely that all 720 combinations are going to be used and shown with equal probability - we will probably see maybe 6-10 combinations that are shown most often, as users visit just one or two product categories and identify themselves as menswear, casual clothes, or womenswear customers. The remaining three or four categories aren't relevant to these customers, and so we don't retarget hat content. I mean: if a user is visiting menswear and men's shoes, then they aren't going to be interested in womenswear and casual clothing, so the sequence of those categories is going to be irrelevant and unchanged.

So, we can group users into one of 720 "segments", not based on how we segment them, but how they segment themselves. This leads to a pseudo-bespoke browsing experience (it isn't 1-to-1, but the numbers are high enough for it to be indistinguishable) that doesn't require the overhead of a huge library of product content waiting to be accessed.

When does the difference between true personalization and segmented retargeting become indistinguishable? Are we chasing true 1-to-1 personalization when it isn't even beneficial to the customers' experience?

I would say that it's when the number of combinations of retargeted content becomes so large that users are seeing a targeted experience each time they come to the page. Or, when the number of combinations is greater than the number of users who visit the page. Personalization is usually perceived - and presented - as the holy grail of Web experience, but in my view it's unnecessary, unattainable and frequently unlikely to actually get off the drawing board. Why not try something that could give actual results, provide improved customer experience and could be set up this side of Christmas?

Tuesday, 28 March 2023

Non-zero Traffic

Barely a month ago, I wrote about a challenge I was having tracking the traffic on this blog. I had identified that I wasn't seeing any visitors on mobile phones. Not one.

I'd taken some steps to fix this, and I had been able to track me using my own phone, but only from organic search. I was still not tracking actual traffic.

Then I found an article that explains how to connect Blogger to Google Analytics 4 (and this has become justification for me to move to GA4).

All I had to do was to enter my GA4 ID number into the Blogger setting for tracking... and that was it. It took me weeks to track down the solution, but since then, I've been tracking mobile traffic:

At the time of writing, I've had the fix live for three weeks, and this is how it's looking compared to the three weeks prior to the fix.

I've carried out a number of checks to make sure the data is valid:

are the mobile and desktop numbers different (or am I double-counting users?)?

are the desktop numbers even numbers (another indication of double counting)?

are the mobile numbers cannibalising the desktop numbers, or are they additional (and thankfully they look good).

In all cases, the data look good, showing clear and distinct differences between the mobile and desktop traffic, but I'm still glad I was able to validate the data accuracy.

Tuesday, 21 March 2023

Why Personalisation Programs Struggle

So why aren’t we living in a world of perfect personalisation? We've been hearing for a while that it'll be the next big thing, so why isn't it happening?

Because it’s hard. There's just too much to consider, especially if you're after the ultimate goal of 1-to-1 personalisation.

In my experience, there are three areas where personalisation strategies come completely unstuck. The first is in the data capture, the second is the classification and design of ‘personas’, and the third is in the visual design.

1. Data capture: what data can you access?

Search keywords?
PPC campaign information?
Marketing campaign engagement?
Browsing history?
Purchase history?
Can you get geographic or demographic information?
Surely you can’t form a 1x1 relationship between each individual user and their experience?
Previous purchaser? And are you going to try and sell them another one of what they just bought?
Traffic source: search/display/social?
What products are they looking at?
What have they added to basket?

2. Classification: how are you going to decide how to aggregate and categorise all this data?

Is it a new user? Return user?

And the biggest crunch: how are you going to then transfer these classifications to your Content Management System, or to your Targeting engine, so that it knows which category to place User #12345 into. And that’s just where the fun begins.

And how do you choose the right data? I'm personally becoming bored of seeing recommendations based on items I've bought: "You bought this printer... how about this printer?" and "You recently purchased a new pair of shoes... would you like to buy a pair of shoes?" As an industry we seem to lack the sophistication that says, "You bought this printer - would you like to buy some ink for it?" or "You bought these shoes, would you like to buy this polish, or these laces?"

3. Visual Design

For each category or persona that you identify, you will need to have a corresponding version of your site. For example, you’ll need to have a banner that promotes a particular product category (a holiday in France, the Caribbean, the Mediterranean, the USA); or you may need to have links to content about men’s shoes; women’s shoes; slippers or sports shoes.

And your site merchandising team now needs to multiply its efforts for its campaigns.

Previously, they needed one banner for the pre-Christmas campaign; now, they need to produce four, five or more instead. This comes as they are approaching their busiest period (because that’s when you’ll get more traffic in and want to maximise its performance) and haven’t got time to generate duplicated content just for one banner.

Fortunately, there are ways of minimizing the headaches that you can encounter when you’re trying to get personalization up and running (or keeping it going).

Why not take the existing content, and show it to users in a different order? Years ago, there was a mantra (with a meme, probably) going around that told us to 'Remember: There is no fold' but I've never subscribed to that view. Analytics regularly shows us that most users don't scroll down to see our wonderful content lying just below the edge of their monitor (or their phone screen). So, if you can identify a customer as someone looking for men's shoes, or women's sports shoes, or a 4x4, or a hatchback, or a plasma TV, then why not show that particular product category first (i.e. above the fold, or at least the first thing below it)?

4. Solutions

The flavour du jour in our house is Airfix modelling - building 1/72 or 1/48 scale vehicles and aircraft, so let's use that as an example, and visit one of the largest online modelling stores in the UK, Wonderland Models.

Their homepage has a very large leading banner, which rotates like a carousel around five different images: a branding image; radio controlled cars; toy animals and figures; toys and playsets; and plastic model kits. The opportunity here is to target users (either return visitors, which is easier, or new users, which is trickier) and show them the banner which is most relevant to them.

The Wonderland Models homepage. The black line is the fold on my desktop.

How do you select which banner? By using the data that users are sharing with you - their previous visits, items they've browsed (or added to cart), or what they're looking for in your site search... and so on. Here, the question of targeted content is simpler - show them the existing banner which closest matches their needs - but the data is trickier. However, the banners and categories will help you determine the data categorization that you need to - you'll probably find this in your site architecture.

However, here's the bonus: when you've classified (or segmented) your user, you can use this content again... lower down on the same page. Most sites duplicate their links, or have multiple links around similar themes, and Wonderland Models is no exception. Here, the secondary categories are Radio Control; Models and Kits; Toys and Collectables; Paints, Tools and Materials; Model Railways and Sale. These overlap with the banner categories, and with a bit of tweaking, the same data source could be used to drive targeting in both segments.

As I covered in a previous blog about targeting the sequence of online banners, the win here is that with six categories (and a large part of the web page being targeted), there are thirty different combinations for just the first two slots, with six options for the first position, and five for the second. This will be useful as the content is long and requires considerable scrolling.

The second and third folds on Wonderland Models. The black lines show the folds.

Most analytics packages have an integration with CMS’s or targeting platforms. Adobe Analytics has Target, which is its testing and targeting tool. It's possible to connect the data from Analytics into Target (and I suspect your Adobe support team would be happy to help) and then use this to make an educated guess on which content to show to your visitors. At the very least, you could run an A/B test.

5. The Challenge

The main reason personalization programs struggle to get going is (and I hate to use this expression, but here goes) that they aren't agile enough. At a time when ecommerce is starting to use the product model and forming agile teams, it seems like personalization is often stuck in a waterfall approach. There's no plan to form a minimum viable product, and try small steps - instead, it's wholesale all-in build-the-monolith, which takes months, then suffers a "funding reprioritization" since the program has nothing to show for its money so far... this makes it even harder to gain traction (and funding) next time around.

6. The Start

So, don't be afraid to start small. If you're resequencing the existing content on your home page, and you have three pieces of content, then there are six different ways that the content can be shown. Without getting into the maths, there's ABC, ACB, BAC, BCA, CAB and CBA. And you've already created six segments for six personas. Or at least you've started, and that's what matters. I've mentioned in a previous article about personalization and sequencing that if you can add in more content into your 'content bank' then the number of variations you can show increases exponentially. So if you can show the value of resequencing what you already have, then you are in a stronger position to ask for additional content. Engaging with an already-overloaded merchandising team is going to slow you down and frustrate them, so only work with them when you have something up-and-running to demonstrate.

Remember - start small, build up your MVP and only bring in stakeholders when you need to. If you want to travel far, travel together, but if you want to travel quickly, travel light!

Saturday, 25 February 2023

Zero Traffic

I've mentioned before that it's always a concern when any one of your success metrics is showing as zero. It suggests that any part of your tracking, calculation or reporting is flawed, and there's no diagnostic information on why.

I have had an ongoing tracking problem with this blog, but hadn't realised until several weeks ago. I use Google Analytics to track traffic, and the tag is included in one of the right-column elements, after the article list and some of the smaller images. All good, lots of traffic coming in on a weekly and monthly basis.

Except, as I realised last November, none of my posts from 2021 or 2022 were showing any traffic. Zero.

I was getting plenty of traffic for my older posts (some of them even rank on the first page of Google for the right search terms) and this was disguising the issue. Overall traffic was flat year-on-year, despite me keeping up a steady flow of new articles each month. And then I discovered two gaps in my data:

1. Zero traffic on mobile phones

2. Zero traffic from social media

Which is obvious in retrospect, since I share most of my posts on Facebook, and my friends comment on my shares.

In order to tackle these two issues, I've taken a number of steps (not all of which have helped).

a. I've moved the tag from an element in the right column to the middle column, under the content of the post. The right column doesn't load in mobile devices, due to thr responsive nature of Blogger, so the tag never loaded. And hence I never saw any of my mobile traffic.

This has worked: I've tested the tag on my own mobile phone and I can see my own visit. Yay! An increase from zero to one is an infinite increase, and it means the tag is working.

b. It turns out that Facebook's own in-app browser doesn't fire the tracking tag. At all. I am in the process of adding the Facebook user agent to my code, and in order to do this, have upgraded to Google Tag Manager. I'm still not seeing Facebook-referred traffic, but it's an improvement.

And I'm looking at moving to a different platform for my blog. I've had this one for over 10 years, and it pre-dates my Facebook account. Maybe it's time for a change?

Monday, 14 November 2022

How many of your tests win?

As November heads towards December, and the end of the calendar year approaches, we start the season of Annual Reviews. It's time to identify, classify and quantify our successes and ~~failures~~ opportunities from 2022, and to look forward to 2023. For a testing program, this usually involves the number of tests we've run, and how many recipes were involved; how much money we made and how many of our tests were winners.

If I ask you, I don't imagine you'd tell me, but consider for a moment: how many of your tests typically win? How many won this year? Was it 50%? Was it 75%? Was it 90%? And how does this reflect on your team's performance?

50% or less

It's probably best to frame this as 'avoiding revenue loss'. Your company tested a new idea, and you prevented them from implementing it, thereby saving your company from losing a (potentially quantifiable) sum of money. You were, I guess, trying some new ideas, and hopefully pushed the envelope - in the wrong direction, but it was probably worth a try. Or maybe this shows that your business instincts are usually correct - you're only testing the edge cases.

Around 75%

If 75% of your tests are winning, then you're in a good position and probably able to start picking and choosing the tests that are implemented by your company. You'll have happy stakeholders who can see the clear incremental revenue that you're providing, and who can see that they're having good ideas.

90% or more

If you're in this apparently enviable position, you are quite probably running tests that you shouldn't be. You're probably providing an insurance policy for some very solid changes to your website; you're running tests that have such strong analytical support, clear user research or customer feedback behind them that they're just straightforward changes that should be made. Either that, or your stakeholders are very lucky, or have very good intuition about the website. No, seriously ;-)

Your win rate will be determined by the level of risk or innovation that your company are prepared to put into their tests. Are you testing small changes, well-backed by clear analytics? Should you be? Or are you testing off-the-wall, game-changing, future-state, cutting edge designs that could revolutionise the online experience?

I've said before that your test recipes should be significantly different from the current state - different enough to be easy to distinguish from control, and to give you a meaningful delta. That's not to say that small changes are 'bad', but if you get a winner, it will probably take longer to see it.

Another thought: the win rate is determined by the quality of the test ideas, and how adventurous the ideas are, and therefore the win rate is a measure of the teams who are driving the test ideas. If your testing team is focused on test ideas and has strengths in web analytics and customer experience metrics, then your team will probably have a high win rate. Conversely, if your team is responsible for the execution of test ideas which are produced by other teams, then a measure of test quality will be on execution, test timing, and quantity of the tests you run. You can't attribute the test win rate (high or low) to a team who develop tests; in fact, the quality of the code is a much better KPI.

What is the optimal test win rate? I'm not sure that there is one, but it will certainly reflect the character of your test program more than its performance.

Is there a better metric to look at? I would suggest "learning rate": how many of your tests taught you something? How many of them had a strong, clearly-stated hypothesis that was able to drive your analysis of your test (winner or loser) and lead you to learn something about your website, your visitors, or both? Did you learn something that you couldn't have identified through web analytics and path analysis? Or did you just say, "It won", or "It lost" and leave it there? Was the test recipe so complicated, or contain so many changes, that isolating variables and learning something was almost completely impossible?

Whatever you choose, make sure (as we do with our test analysis) that the metric matches the purpose, because 'what gets measured gets done'.

Similar posts I've written about online testing

Getting an online testing program off the ground
Building Momentum in Online testing
Testing vs Implementing Directly

Friday, 13 May 2022

Website Compromization

Test data, just like any other data, is open to interpretation. The more KPIs you have, the more the analysis can be pointed towards one winning test recipe or another. I've discussed this before, and used my long-suffering imaginary car salespeople to show examples of this.

Instead of a clear-cut winner, which is the best on all cases, we often find that we have to select the recipe which is the best for most of the KPIs, or the best for the main KPI, and appreciate that maybe it's not the best design overall. Maybe the test recipe could be improved if additional design changes were made - but there isn't time to test these extra changes before the marketing team need to get their new campaign live (or the IT team need to deploy the winner in their next launch).

Do we have enough time to actually identify the optimum design for the site? Or the page? Or the element we're testing?

Anyways - is this science, or is it marketing? Do we need to make everything on the site perfectly optimized? Is 'better than control' good enough, or are we aiming for 'even better'?

What do we have? Is this site optimization, a compromise, or compromization?

Or maybe you have a test result that shows that your users liked a new feature - they clicked on it, they purchased your product. Does this sound like a success story? It does, but only until you realise that the new feature you promoted has diverted users' attention away from your most profitable path. To put it another way, you coded a distraction.

For example - your new banner promotes new sports laces for your new range of running shoes... so users purchase them but spend less on the actual running shoes. And the less expensive shoes have a lower margin, so you actually make less profit. Are you trying to sell new laces, or running shoes?

Or you have a new feature that improves the way you sort your search results, with "Featured" or "Recommend" or "Most Relevant" now serving up results that are genuinely what customers want to see. The problem is, they're the best quality but lowest-priced products in your inventory, so your conversion rate is up by 10% but your average order value is down by 15%. What do you do?

Are you following customer experience optimization, or compromization?

Sometimes, you'll need to compromise. You may need to sell the new range of shiny accessories with a potential loss of overall profit in order to break into a new market. You may decide that a new feature should not be launched because although it clearly improves overall customer experience and sales volumes, it would bring down revenue by 5%. But testing has shown what the cost of the new feature would be (and perhaps a follow-up test with some adjustments would lead to a drop in revenue of only 2%... would you take that?). In the end, it's going to be a matter of compromization.

Monday, 6 September 2021

It's Not Zero!

I started this blog many years ago. It pre-dates at least two of my children, and possibly all three - back in the days when I had time to spare, time to write and time to think of interesting topics to write about. Nowadays, it's a very different story, and I discovered that my last blog post was back in June. I used to aim for one blog article per month, so that's two full months with no digital output here (I have another blog and a YouTube channel, and they keep me busy too).

I remember those first few months, though, trying to generate some traffic for the blog (and for another one I've started more recently, and which has seen a traffic jump in the last few days).

Was my tracking code working? Was I going to be able to see which pages were getting any traffic, and where they were coming from? What was the search term (yes, this goes back to those wonderful days when Google would actually tell you your visitors' search keywords)?

I had weeks and weeks of zero traffic, except for me checking my pages. Then I discovered my first genuine user - who wasn't me - actually visiting my website. Yes, it was a hard-coded HTML website and I had dutifully copied and pasted my tag code into each page... did it work? Yes, and I could prove it: traffic wasn't zero.

So, if you're in the point (and some people are) of building out a blog, website or other online presence - or if you can remember the days when you did - remember the day that traffic wasn't zero. We all implemented the tag code at some point; or sent the first marketing email, and it's always a moment of relief when that traffic starts to appear.

Small beginnings: this is the session graph for the first ten months of 2010, for this blog. It's not filtered, and it suggests that I was visiting it occasionally to check that posts had uploaded correctly! Sometimes, it's okay to celebrate that something isn't zero any more.

And, although you didn't ask, here's the same period January-October 2020, which quietly proves that my traffic increases (through September) when I don't write new articles. Who knew?

Thursday, 24 June 2021

How long should I run my test for?

A question I've been facing more frequently recently is "How long can you run this test for?", and its close neighbour "Could you have run it for longer?"

Different testing programs have different requirements: in fact, different tests have different requirements. The test flight of the helicopter Ingenuity on Mars lasted 39.1 seconds, straight up and down. The Wright Brothers' first flight lasted 12 seconds, and covered 120 feet. Which was the more informative test? Which should have run longer?

There are various ideas around testing, but the main principle is this: test for long enough to get enough data to prove or disprove your hypothesis. If your hypothesis is weak, you may never get enough data. If you're looking for a straightforward winner/loser, then make sure you understand the concept of confidence and significance.

What is enough data? It could be 100 orders. It could be clicks on a banner : the first test recipe to reach 100 clicks - or 1,000, or 10,000 - is the winner (assuming it has a large enough lead over the other recipes).

An important limitation to consider is this: what happens if your test recipe is losing? Losing money; losing leads; losing quotes; losing video views. Can you keep running a test just to get enough data to show why it's losing? Testing suddenly becomes an expensive business, when each extra day is costing you revenue. One of the key advantages of testing over 'launch it and see' is the ability to switch the test off if it loses; how much of that advantage do you want to give up just to get more data on your test recipe?

Maybe your test recipe started badly. After all, many do: the change of experience from the normal site design to your new, all-improved, management-funded, executive-endorsed design is going to come as a shock to your loyal customers, and it's no surprise when your test recipe takes a nose-dive in performance for a few days. Or weeks. But how long can you give your design before you have to admit that it's not just the shock of the new design, (sometimes called 'confidence sickness') but that there are aspects of the new design that need to be changed before it will reach parity with your current site? A week? Two weeks? A month? Looking at data over time will help here. How was performance in week 1? Week 2? Week 3? It's possible for a test to recover, but if the initial drop was severe, then you may never recover the overall picture, but if you can find that the fourth week was actually flat (for new and return visitors) then you've found the point where users have adjusted to your new design.

If, however, the weekly gaps are widening, or staying the same, then it's time to pack up and call it a day.

Let's not forget that you probably have other tests in your pipeline which are waiting for the traffic that you're using on your test. How long can they wait until launch?

So, how long should you run your test for? As long as possible to get the data you need, and maybe longer if you can, unless it's
- suffering from confidence sickness (keep it running)
- losing badly, and consistently (unless you're prepared to pay for your test data)
- losing and holding up your testing pipeline

Similar posts I've written about online testing

Getting an online testing program off the ground
Building Momentum in Online testing
How many of your tests win?

Wright Brothers Picture:

"Released to Public: Wilber and Orville Wright with Flyer II at Huffman Prairie, 1904 (NASA GPN-2002-000126)" by pingnews.com is marked with CC PDM 1.0

Web Optimisation, Maths and Puzzles

Header tag