Header tag

Wednesday 10 January 2024

Statistics: Type 1 and Type 2 Errors

 In statistics (and by extension, in testing), a Type I error is a false positive conclusion (we think a test recipe won when it didn't), while a Type II error is a false negative conclusion (we think the test recipe lost, when it didn't).  

Making a statistical decision always involves uncertainties, because we're sampling instead of looking at the whole population.  This means the risks of making these errors are unavoidable in hypothesis testing - we don't know everything because we can't measure everything.  However, that doesn't mean we don't know anything - it just means we need to understand what we do and don't know.


The probability of making a Type I error is the significance level, or alpha (α), while the probability of making a Type II error is beta (β).  Incidentally, the statistical power of a test is measured by 1- β.  I'll be looking at the statistical power of a test in a future blog.

These risks can be minimized through careful planning in your test design.

To reduce Type 1 errors, which mean falsely rejecting the null hypothesis - and calling a winner when the results were flat - it is crucial to choose an appropriate significance level and stick to it. Being cautious when interpreting results and also considering what the findings mean may also help mitigate Type 1 errors.  Different companies have different significance levels that they use when testing, depending on how cautious or ambitious they want to be with their testing program.  If there are millions of dollars at risk per year, or developing a new site or design will cost months of work, then adopting a higher significance level (90% or higher) may be the order of the day.  Conversely, if you're a smaller operator with less traffic, or a change that can be easily unpicked if things don't go as expected, then you could use a lower significance level (80% or higher).

It's worth saying at this point that human beings are lousy at understanding and interpreting probabilities, and that's generally.  Confidence levels and probabilities are related but are not directly interchangeable.  The difference in confidence between 90% and 80% is not the same as between 80% and 70%.  It becomes more and more 'difficult' to increase a confidence level as you approach 100% confidence.  After all, can you really say something is 100% certain to happen when you've only taken a sample (even if it's a really large sample)?  On the other hand, it's easy to the point of inevitable that a small sample can give you a 50% confidence level.  What did you prove?  That a coin is equally likely to give you heads or tails?


 Type 2 errors can be minimised by using high levels of statistical significance, or (unsurprisingly) by using a larger sample size.  The sample size determines the degree of sampling error, which in turn sets the ability to detect the differences in a hypothesis test. A larger sample size increases the chances to capture the differences in the statistical tests, and also increases a test's power. 

Practically speaking, Type 1 and Type 2 errors (false positives and false negatives) are an inherent feature of A/B testing, and the best ways to minimize them is to have a pre-agreed minimum sample size, and a pre-determined confidence level that everyone (business teams, marketing, testing team) are all agreed on.  Otherwise, there'll be discussions and debates afterwards about what's a winner, what's confident, what's significant and what's actually a winner.