Header tag

Thursday, 24 June 2021

How long should I run my test for?

 A question I've been facing more frequently recently is "How long can you run this test for?", and its close neighbour "Could you have run it for longer?"

Different testing programs have different requirements:  in fact, different tests have different requirements.  The test flight of the helicopter Ingenuity on Mars lasted 39.1 seconds, straight up and down.  The Wright Brothers' first flight lasted 12 seconds, and covered 120 feet.  Which was the more informative test?  Which should have run longer?

There are various ideas around testing, but the main principle is this:  test for long enough to get enough data to prove or disprove your hypothesis.  If your hypothesis is weak, you may never get enough data.  If you're looking for a straightforward winner/loser, then make sure you understand the concept of confidence and significance.

What is enough data?  It could be 100 orders.  It could be clicks on a banner : the first test recipe to reach 100 clicks - or 1,000, or 10,000 - is the winner (assuming it has a large enough lead over the other recipes). 

An important limitation to consider is this:  what happens if your test recipe is losing?  Losing money; losing leads; losing quotes; losing video views.  Can you keep running a test just to get enough data to show why it's losing?  Testing suddenly becomes an expensive business, when each extra day is costing you revenue.   One of the key advantages of testing over 'launch it and see' is the ability to switch the test off if it loses; how much of that advantage do you want to give up just to get more data on your test recipe?

Maybe your test recipe started badly.  After all, many do:  the change of experience from the normal site design to your new, all-improved, management-funded, executive-endorsed design is going to come as a shock to your loyal customers, and it's no surprise when your test recipe takes a nose-dive in performance for a few days.  Or weeks.  But how long can you give your design before you have to admit that it's not just the shock of the new design, (sometimes called 'confidence sickness') but that there are aspects of the new design that need to be changed before it will reach parity with your current site?  A week?  Two weeks?  A month?  Looking at data over time will help here.  How was performance in week 1?  Week 2?  Week 3?  It's possible for a test to recover, but if the initial drop was severe, then you may never recover the overall picture, but if you can find that the fourth week was actually flat (for new and return visitors) then you've found the point where users have adjusted to your new design.

If, however, the weekly gaps are widening, or staying the same, then it's time to pack up and call it a day.

Let's not forget that you probably have other tests in your pipeline which are waiting for the traffic that you're using on your test.  How long can they wait until launch?

So, how long should you run your test for?  As long as possible to get the data you need, and maybe longer if you can, unless it's
- suffering from confidence sickness (keep it running)
- losing badly, and consistently (unless you're prepared to pay for your test data)
- losing and holding up your testing pipeline

Wright Brothers Picture:

"Released to Public: Wilber and Orville Wright with Flyer II at Huffman Prairie, 1904 (NASA GPN-2002-000126)" by pingnews.com is marked with CC PDM 1.0