Tuesday, 19 June 2018

When Should You Switch A Test Off? (Tunisia 1 - England 2)

Another day yields another interesting and data-rich football game from the World Cup.  In this post, I'd like to look at answering the question, "When should I switch a test off?" and use the Tunisia vs England match as the basis for the discussion.

Now, I'll admit I didn't see the whole match (but I caught a lot of it on the radio and by following online updates), but even without watching it, it's possible to get a picture of the game from looking at the data, which is very intriguing.  Let's kick off with the usual stats:

The result after 90 minutes was 1-1, but it's clear from the data that this would be a very one-sided draw, with England having most of the possession, shots and corners.  It also appears that England squandered their chances - the Tunisian goalkeeper made no saves, but England could only get 44% of their 18 shots on target (which kind of begs the question - what about the others - and the answer is that they were blocked by defenders).  There were three minutes of stoppage time, and that's when England got their second goal.

[This example also shows the unsuitability of the horizontal bar graph as a way of representing sports data - you can't compare shot accuracy (44% vs 20% doesn't add up to 100%) and when one team has zero (bookings or saves) the bar disappears completely.  I'll fix that next time.]

So, if the game had been stopped at 90 minutes as a 1-1 draw, it's fair to say that the data indicates that England were the better team on the night and unlucky to win.  They had more possession and did more with it. 

Comparison to A/B testing

If this were a test result and your overall KPI was flat (i.e. no winner, as in the football game), then you could look at a range of supporting metrics and determine if one of the test recipes was actually better, or if it was flat.  If you were able to do this while the test was still running, you could also take a decision on whether or not to continue with the test.

For example, if you're testing a landing page, and you determine that overall order conversion and revenue metrics are flat - no improvement for the test recipe - then you could start to look at other metrics to determine if the test recipe really has identical performance to the control recipe.  These could include bounce rate; exit rate; click-through rate; add-to-cart performance and so on.  These kind of metrics give us an indication of what would happen if we kept the test running, by answering the question: "Given time, are there any data points that would eventually trickle through to actual improvements in financial metrics?"

Let's look again at the soccer match for some comparable and relevant data points:

*  Tunisia are win-less in their last 12 World Cup matches (D4 L8).  Historic data indicates that they were unlikely to win this match.

*  England had six shots on target in the first half, their most in the opening 45 minutes of a World Cup match since the 1966 semi-final against Portugal.  In this "test", England were trending positively in micro-metrics (shots on target) from the start.

Tunisia scored with their only shot on target in this match, their 35th-minute penalty.  Tunisia were not going to score any more goals in this game.

*  England's Kieran Trippier created six goalscoring opportunities tonight, more than any other player has managed so far in the 2018 World Cup.  "Creating goalscoring opportunities" is typically called "assists" and isn't usually measured in soccer, but it shows a very positive result for England again.

As an interesting comparison - would the Germany versus Mexico game have been different if the referee had allowed extra time?  Recall that Mexico won 1-0 in a very surprising result, and the data shows a much less one-sided game.  Mexico won 1-0 and, while they were dwarfed by Germany, they put up a much better set of stats than Tunisia (compare Mexico with 13 shots vs Tunisia with just one - which was their penalty).  So Mexico's result, while surprising, does show that they did play an attacking game and should have achieved at least a draw, while Tunisia were overwhelmed by England (who, like Germany should have done even better with their number of shots).

It's true that Germany were dominating the game, but weren't able to get a decent proportion of shots on target (just 33%, compared to 40% for England) and weren't able to fully shut out Mexico and score.  Additionally, the Mexico goalkeeper was having a good game and according to the data was almost unbeatable - this wasn't going to change with a few extra minutes.

Upcoming games which could be very data-rich:  Russia vs Egypt; Portugal vs Morocco.

Monday, 18 June 2018

The Importance of Being Earnest with Your KPIs

It’s World Cup time once again, and a prime opportunity to revisit the importance of having the right KPIs to measure your performance (football team, website, marketing campaign, or whichever).  Take a look at these facts and apparent KPIs, taken from a recent World Cup soccer match, and notice how it’s possible to completely avoid what your data is actually telling you. 

*  One goalkeeper made nine saves during the match, which is three more than any other goalkeeper in the World Cup so far.

* One team had 26 shots in the game – without scoring – which is the most so far in this World Cup, and equals Portugal in their game against England in 2006.  The other team had just 13 shots in the game, and only four on target.

*  One team had just 33% possession:  they had the ball for only 30 minutes out of the 90-minute game

* One team had eight corners; the other managed just one.

A graph may help convey some additional data, and give you a clue as to the game (and the result).

If you look closely, you’ll note that the team in green had four shots on target, while other team only managed three saves.

Hence the most important result in the game – the number of goals scored – gets buried (if you’re not careful) and you have to carry out additional analysis to identify that Mexico won 1-0, scoring in the first half and then holding onto their lead with only 33% possession.

Monday, 11 June 2018

Spoiler-Free Review of Jurassic World: The Fallen Kingdom

Jurassic World: The Fallen Kingdom is the latest addition to the Jurassic Park/Jurassic World franchise, and strikes an uneasy balance between retreading old themes and covering new material.  There are the dinosaurs; there are the heroes and the villians; there's even a child cowering and quaking while a dinosaur approaches.  It's all there - if you've seen and enjoyed the previous films, you'll enjoy this one too.

Universal Pictures
The story moves at a very good pace - yes, there are the slower, plot-development scenes where the villains outline their master plan, and the heroes trade jokes and contemplate the future of dinosaur-kind.  I won't share too much of the plot, but Owen and Claire are persuaded to return to Isla Nubar when it's discovered that it's an active volcano and all the dinosaurs are going to be killed.  The return to the island is filmed particularly well, as we see a Jurassic World that has fallen into disrepair, death and decay, in stark contrast to the lavish bright colours we saw in the previous film.  The aftermath of the Indominus's rampage is visible everywhere (including in some very neat detail shots).

The visual effects of dinosaurs plus volcano are extremely well executed, and there is the usual quota of running, shouting, chasing, and hiding, all delivered at breakneck speed. In fact, it's so fast that you may miss one or two of the plot developments, but fear not, there's plenty of chance to catch up.  The entire second half of the film takes place off the island - so this is unlike most of the previous films.  Yes, there are comparisons with The Lost World, but this film has a lot more about it than that.

Is the film scary?  Yes.  There are plenty of suspenseful moments... teeth and claws appearing slowly out of the murky darkness; rustling trees getting closer - all that stuff.  This is more scary than the high-speed dinosaur vs human or dinosaur vs dinosaur stuff - and there's plenty of that too.  There are two extended scenes in the second half where one particularly nasty dinosaur starts stalking its human prey, but apart from that there's not much that we haven't seen before.

Is it gory?  No.  Despite a body count that puts it on a par with the other films, there isn't much visible blood - one character has his arm bitten off, and the amount of blood is almost too small to be plausible.  There's at least one death on camera, but it's out-of-focus and in the background.  I took two children - aged seven and nine - with me, and the nine-year-old was upset by some of the tragic scenes, but neither of them were particularly scared.

All-in-all, I liked this film: it is exactly what you would expect, with some interesting twists.  I know it's had mixed reviews, but it does a good job of staying true to its roots while expanding the wider storyline in a number of unexpected ways.  The speed at which the film moves through the plot, with some serious and irreversible actions, means that this is - in my view - more than just another sequel and is not as derivative as some make it seem.

Monday, 14 May 2018

Online Optimisation: Testing Sequences

As your online optimisation program grows and develops, it's likely that you'll progress from changing copy or images or colours, and start testing moving content around on the page - changing the order of the products that you show; moving content from the bottom of the page to the top; testing to see if you achieve greater engagement (more clicks; lower bounce rate; lower exit rate) and make more money (conversion; revenue per visitor).  A logical next step up from 'moving things around' is to test the sequence of elements in a list or on a page.  After all, there's no new content, no real design changes, but there's a lot of potential in changing the sequence of the existing content on the page.
Sequencing tests can look very simple, but there are a number of complexities to think about - and mathematically, the numbers get very large very quickly.  

As an example, here's the Ford UK's cars category page, www.ford.co.uk/cars.

[The page scrolls down; I've split it into two halves and shown them side-by-side].

Testing sequences can quickly become a very mathematical process:  if you have just three items in a list, then the number of recipes is six; if you have four items, then there are 24 different sequences (see combinations without repetition).  Clearly, some of these will make no sense (either logically or financially) so you can cut out some of the options, but that's still going to leave you with a large number of potential sequences.  In Ford's example here, with 20 items in the list, there are 2,432,902,008,176,640,000 different options.

Looking at Ford, there appears to be some form of sorting (default) which is generally price low-to-high and slightly by size or price, with a few miscellaneous tagged onto the end (the Ford GT, for example).  At first glance, there's very little difference between many of the cars - they look very, very similar (there's no sense of scale or of the specific features of each model).

Since there are two quintillion various ways of sequencing this list, we need to look at some 'normal' approaches, and are, of course, a number of typical ways of sorting products that customers are likely to gravitate towards - sorting by alphabetical order; sorting by price or perceived value (i.e. start with the the lower quality products and move to luxury quality), and you could also add to that sorting by most popular (drives the most clicks or sales).  Naturally, if your products have another obvious sorting option (such as size, width, length or whatever) then this could also be worth testing.

What are the answers?  As always:  plan your test concept in advance.  Are you going to use 'standard' sorting options, such as size or price, or are you going to do something based on other metrics (such as click-through-rate, revenue or page popularity)?  What are the KPIs you're going to measure?  Are you going for clicks, or revenue?  This may lead to non-standard sequences, where there's no apparent logic to the list you produce.  However, once you've decided, 
the number of sequences falls from trillions to a handful, and you can start to choose the main sequences you're going to test.

For Ford, price low to high (or size large to small), popularity (sales), grouping by model size (hatchback, saloon, off-road/SUV, sports) may also work - and that leads on to sub-categorization and taxonomy, which I'll probably cover in an upcoming blog.


Wednesday, 11 April 2018

Chess and Machine Learning?

Machine learning is a new, exciting and growing area of computer science that looks at if and how computers can learn without explicitly being taught.  Within the last few weeks, machine learning programs have learned games such as Go and Chess, and become very capable players: Google's AlphaZero beat the well-known Chess engine Stockfish after just 24 hours of learning how to play; just over a year ago, AlphaGo beat the world's strongest human player Ke Jie at Go.

AlphaZero is different from all previous Chess engines, in that it learns by playing.  Having been programmed with the rules of Chess (aims of the game; how the pieces move), it played 1000 games against itself, learning as it went.  The Google Alpha Zero team have published a paper of their research, and it makes for interesting reading.

From a Chess perspective, the data is very interesting as it shows how Alpha Zero discovered some key well-known openings (the English; the Sicilian; the Ruy Lopez) and how it used them in games, and then discarded them as it found 'better' alternatives.  Table 2 on page 6 shows how the frequency of each opening varied against training time.  There are some interesting highlights in the data:

The English Opening (1. c4 e5 2. g3 d5 3. cxd5 Nf6 4. Bg2 Nxd5 5. Nf3) was a clear favourite with Alpha Zero from very early on, and grew in popularity.

The Queen's Gambit (1. d4 d5 2. c4 c6 Nc3 Nf6 Nf3 a6 g3 c4 a4) also became a preferred opening.

Interestingly, the Sicilian Defence (1. ... c5) was not favoured, instead the preferred line against 1. e4 was the Ruy Lopez (1 ... e5).

It's worth remembering that Alpha Zero deduced these well-known and long-played openings and variations by itself in 24 hours - compared to the decades (and centuries) of human play that has gone into developing these openings.

Apart from the purely academic exercises of building machines that can learn to play games, there are the financially lucrative applications of machine learning: product recommendations.  Amazon and Netflix make extensive use of recommenders, where machines make forecasts about a user, based on users who showed similar behaviour ("people who liked what you like also like this...").  Splitting out and segmenting all users to find users with similar properties is a key part of the machine learning process for this application.

In conclusion:  
"It's an exciting time for Machine Learning.  There is ample work to be done at all levels: from the theory end to the framework end, much can be improved.  It's almost as exciting as the creation of the internet."  Ryan Dahl, inventor of Node.js

Monday, 5 March 2018

Why are manhole lids circular?

I remember reading this question - and its answer - in a maths puzzle book in my mid teens. It's a very simple solution - and very easy to start investigating further.  The short answer: manhole lids are circular so that they don't fall down the hole (risking losing the lid, and landing on a worker who is in the hole).  Technically, the lid has a constant maximum diameter irrespective of which angle you use to measure it. 

The same cannot be said of most other polygons - let's take some quick examples.
Squares: the sides are shorter than the diagonals, so a small rotation will enable the lid to fall down the hole.
Pentagons: the ratio of side to diameter is smaller, but it's still possible to drop the lid down the hole.
Equilateral triangles are an exception; and in fact you do sometimes see manhole lids that are equilateral triangles (sometimes hinged along one side).

The same principle applies to coins. In order to function correctly,  a vending machine has to be able to identify and distinguish different coins, based on their diameter and irrespective of how they fall through the slot.  The coins which are not circular are based on Reuleaux polygons, such as the Reuleaux triangle, where the shape has a constant diameter - the key requirement for coins, and manhole covers!

Wednesday, 14 February 2018

Film Review: Star Wars The Last Jedi

I loved it.

My first impressions from the first few minutes was that this was a retread of Empire Strikes Back.  The First Order have tracked down the resistance base on a remote planet, and the resistance are trying to evacuate before the First Order land troops and... oh, wait a minute, there is no shield, no cannon and the base is going to be obliterated from space.  And things seem to go well for the resistance, as they are able to stall long enough to get almost everybody safely aboard their cruiser and off to safety.  But not before Poe Dameron (X-Wing ace turned hot-headed insubordinate comedian pilot) decides to sacrifice the entire bomber fleet just to destroy a Dreadnaught.  Let's here it for Pyrrhic victories!

Worse still, the First Order have developed a way to track the Resistance through hyperspace: running away is not a way to escape, and hyperspace fuel is in limited supply.

At the end of the previous film, Rey had successfully tracked down Luke Skywalker, and much of this film covers her efforts to persuade him to join the Resistance.  So, we have space battles interspersed with the story of a Jedi master and a young Jedi-wannabe/trainee on a remote, green, damp planet.  Like I said, I kept recalling Empire Strikes Back throughout this film. I haven't looked online to see if anybody has listed all the parallels between The Last Jedi and The Empire Strikes back, but I saw a few (and I'm only a casual movie-goer).  Luke Skywalker has traded his youthful naivety and enthusiasm for jaded cynicism.  The way he casually lobs his lightsabre over his shoulder is both funny and tragic at the same time.

My only niggle with the film is the amount of time spent on the story with Rey and Luke.  The other storylines were far more exciting and just downright interesting; Luke and Rey - less so.  Luke goes for a walk.  Luke catches a fish.  Luke wanders around his island.  Yawn.

The plot makes a lot of sense, and there's a direct causal link between the Admiral and her tight-lipped need-to-know authoritarian attitude, conflicting with Poe Dameron's "we have a right to know what's going on" and the subsequent demise of the resistance fleet.  If she'd told Poe what her plan was, he wouldn't have sent Finn off to find the code breaker, who wouldn't have subsequently told the First Order about the resistance's plans and their cloaking frequency (or whatever it was).  If they'd all stayed home, sat tight and waited it out, they might all have survived.  I'm not blaming him or her, but it seems like the two characters managed to deliberately out-hard-head each other - aiming to be the most stubborn character and the one who wins, until neither of them do.

Some of my favourite aspects of the film is how the script addresses some of the criticisms that were levelled at the first of the new films (The Force Awakens).

"Finn should have had that fight with Captain Phasma, not with some random stormtrooper with a cool elbow mounted weapon."  Cue large-scale, violent, hand-to-hand fight between Finn and Phasma.

"Snoke is too much like the Emperor and there's no real explanation for him."  Kill him off - now who saw that coming?

"More Poe Dameron!" - definitely fixed in this episode.  He kicks off the action at the start; we see more of his character throughout this film (borderline arrogant, but still funny) and he commits mutiny.  This is not a replacement for Han Solo; this is a whole new character who has his own ideas, opinions and history.

"Do something different!"  - I saw most of the parallels between The Force Awakens and A New Hope.  In fact, it felt like a rehash of the story with new faces. As I mentioned earlier, The Last Jedi has elements of The Empire Strikes Back in it, but those elements have been rearranged to produce a fresh story (and no, I didn't for one second think "It's salt!", I knew full well it was meant to be snow).

All-in-all, I'm excited for the next installment; I'm looking forwards to the Han Solo movie and I feel even more optimistic for the future of the Star Wars saga.