Header tag

Wednesday 10 July 2024

How not to Segment Test Data

 Segmenting Test Data Intelligently

Sometimes, a simple 'did it win?' will provide your testing stakeholders with the answer they need. Yes, conversion was up by 5% and we sold more products than usual, so the test recipe was clearly the winner.  However, I have noticed that this simple summary is rarely enough to draw a test analysis to a close.  There are questions about 'did more people click on the new feature?' and 'did we see better performance from people who saw the new banner?'.  There are questions about pathing ('why did more people go to the search bar instead of going to checkout?') and there are questions about these users.  Then we can also provide all the in-built data segments from the testing tool itself.  Whichever tool you use, I am confident it will have new vs return users; users by geographic region; users by traffic source; by landing page; by search term... any way of segmenting your normal website traffic data can be unleashed onto your test data and fill up those slides with pie charts and tables.

After all, segmentation is key, right?  All those out-of-the-box segments are there in the tool because they're useful and can provide insight.

Well, I would argue that while they can provide more analysis, I'm not sure about more insights (as I wrote several years ago).  And I strongly suspect that the out-of-the-box segments are there because they were easy to define and apply back when website analytics was new.  Nowadays, they're there because they've always been there,  and because managers who were there at the dawn of the World Wide Web have come to know and love them (even if they're useless.  The metrics, not the managers).

Does it really help to know that users who came to your site from Bing performed better in Recipe B versus Recipe A?  Well, it might - if the traffic profile during the test run was typical for your site.  If it is, then go ahead and target Recipe B for users who came from Bing.  And please ask your data why the traffic from Bing so clearly preferred Recipe B (don't just leave it at that).

Visitors from Bing performed better in Recipe B?  So what?

Is it useful to know that return users performed better in Recipe C compared to Recipe A?

Not if most of your users make a purchase on their first visit:  they browse the comparison sites, the expert review sites and they even look on eBay, and then they come to your site and buy on their first visit.  So what if Recipe C was better for return users?  Most of your users purchase on their first visit, and what you're seeing is a long-tail effect with a law of diminishing returns.  And don't let the argument that 'All new users become return users eventually' sway you.  Some new users just don't come back - they give up and don't try again.  In a competitive marketplace where speed, efficiency and ease-of-use are now basic requirements instead of luxuries, if your site doesn't work on the first visit, then very few users will come back - they'll find somewhere easier instead.  

And, and, and:  if return users perform better, then why?  Is it because they've had to adjust to your new and unwieldy design?  Did they give up on their first visit, but decide to persevere with it and come back for more punishment because the offer was better and worth the extra effort?  This is hardly a compelling argument for implementing Recipe C.  (Alternatively, if you operate a subscription model, and your whole website is designed and built for regular return visitors, you might be on to something).  It depends on the size of the segments.  If a tiny fraction of your traffic performed better, then that's not really helpful.  If a large section of your traffic - a consistent, steady source of traffic - performed better, then that's worth looking at.

So - how do we segment the data intelligently?

It comes back to those questions that our stakeholders ask us: "How many people clicked?" and "What happened to the people who clicked, and those who didn't?"  These are the questions that are rarely answered with out-of-the-box segments.  "Show me what happened to the people who clicked and those who didn't" leads to answers like, "We should make this feature more visible because people who clicked it converted at a 5% higher rate." You might get the answer that, "This feature gained a very high click rate, but made no impact [or had a negative effect] on conversion." This isn't a feature: it's a distraction, or worse, a roadblock.

The best result is, "People who clicked on this feature spent 10% more than those who didn't."

And - this is more challenging but also more insightful - what about people who SAW the new feature, but didn't click?  We get so hung up on measuring clicks (because clicks are the currency of online commerce) that we forget that people don't read with their mouse button.  Just because somebody didn't click on the message doesn't mean they didn't see it: they saw it and thought, "Not interesting," "not relevant" or "Okay, that's good to know but I don't need to learn more".  The message that says, "10% off with coupon code SAVETEN - Click here for more" doesn't NEED to be clicked.  And ask yourself "Why?" - why are they clicking, why aren't they?  Does your message convey sufficient information without further clicking, or is it just a headline that introduces further important content.  People will rarely click Terms and Conditions links, after all, but they will have seen the link.

We forget that people don't read with their mouse button.

So we're going to need to have a better understanding of impressions (views) - and not just at a page level, but at an element level.  Yes, we all love to have our messages, features and widgets at the top of the page, in what my high school Maths teacher called "Flashing Red Ink".  However, we also have to understand that it may have to be below the fold, and there, we will need to get a better measure of how many people actually scrolled far enough to see the message - and then determine performance for those people.  Fortunately, there's an abundance of tools that do this; unfortunately, we may have to do some extra work to get our numerators and denominators to align.  Clicks may be currency, but they don't pay the bills.

So:  segmentation - yes.  Lazy segmentation - no.


Friday 5 July 2024

What is the shortest distance from a point to a line? A spreadsheet solution

Once or twice a year, if I'm lucky, the Red Arrows will fly near my house.  They'll be flying en route to or from an airshow, or heading back to their home airbase.  I check their flightpaths on various websites (Military Airshows is my favourite, since it provides maps of the flightpaths) and then see if they'll be anywhere near me.  

Then comes the question - where's the best place to go and see them fly over?  Ignoring the lie of the land (I live near the top of a hill, with valleys and hills on almost all sides), where is the point that is the shortest distance from my house?



And, being a maths student, I generalised:  what's the shortest distance between a point and a line?

To start with, we need to understand that the shortest distance from a point and a line is the length of the perpendicular drawn from the point to the line.  In the diagrams below, A represents the point (my house), and the flight path goes from B to C.  D is the point at which the line is closest to point A.  The angle is 90 degrees, and a circle centred on A would form a tangent to the line BC at point D.






Fortunately, the waypoints for the Red Arrows flights are given as longitudes and latitudes, and I know the same for my own home.  But let's simplify to x and y co-ordinates.  We can transfer the flightpath to a straight line of the form y=mx+c, and start with some simple numbers.  For example, let's take point B above as the point (0,1) and the point C as (4,3).  Point A (my house) is (1,3).  Point D is not necessarily the midpoint of B and C.

We know (and this is maths I'm going to use without proving) that if the line BC has the slope m (in the form y = mx+c), then the slope of the line AD is -1/m because the lines are perpendicular.

The strategy breaks down into four separate sections:

1. Determine the equation of the line BC in the form y= mx + c by first determining m and then c.
2. Determine the equation of the line AD, also in the form y = mx + c.  We will know m for this line, and can use this and the values of x,y for A to determine c.
3.  Equate the expressions for y in 1. and 2. as simultaneous equations, to get the x,y values for point D.
4.  Use Pythagoras to determine the distance AD.


1.  Determine the equation for the line BC.

y = mx + c where m = (y2 - y1)/(x2 - x1).

In our example, m = (3-1)/(4-0) = 2/4 = 0.5
Substituting this value into the coordinates for point B will give us the value of c.  B = (0,1) so if y=-0.5x + c then c = y - 2x = 1 - 0 = 1.

So the line has the formula y = 0.5 x + 1.

2. Determine the equation for the line AD.

Since AD is perpendicular to BC, we know m = -1/0.5 = -2.

We have point A on this line, so we know we have (1,3)
y = mx + c
3 = (-2 * 1) + c
5 = c

And hence the formula for the 'radius' from A to D is y = -2x + 5

3.  Equate the two lines, and solve the simultaneous equation to find the nearest point

The lines are:
y = 0.5x + 1 (the flightpath)
y = -2x + 5 (the path from my house to the flightpath's nearest point)

0.5x + 1 = -2x + 5
2.5x = 4
x = 1.6

And by substitution, y = 0.5x  + 1  so y = 1.8

So the nearest point to the line, point D, is (1.6, 1.8)

4.  Use Pythagoras to determine the straight-line distance from A to D

A = (1,3)
D = (1.6,1.8)

Distance = SQRT((1.6-1)2 + (1.8-3)2)
Distance = 1.34

And next, to replicate this in a spreadsheet.  All that this requires is to translate our step-by-step thinking into spreadsheet formulae:


The key steps here are in finding the slope of BC, and then using the reciprocal to find the slope of AD.  The two constants, c, are found by substitution (i.e. rearrange y = mx + c with known y, m and x to determine c).
Then, for clarity, spell out the formulae of the two lines, and use the values of m and c to determine the co-ordinates of point D - first x, then y.

Then use Pythagoras to determine the distance from A to D.

No, it's not entirely efficient, or tidy, but a spreadsheet like this shows the entire process from end to end (and makes you think about how you actually do geometry and algebra, instead of just punching in numbers).