A/B testing is great, and very easy to do these days. Tools are getting better and better. As a result, people rely more and more on the tools. As a result, critical thinking is much less common.
It’s not fair to just blame the tools of course. It’s very human to try to (over)simplify everything. Now the internet is flooded with A/B testing posts and case studies full of bullshit data, imaginary wins. Be wary when you read any testing case study, or whenever you hear someone say “we tested that”.
We’re all learning about A/B testing. It’s like anything else – the more you do it, the better you get at it. So it’s only natural that every optimizer (including myself) has made a ton of testing mistakes in the past. Many mistakes are more common than others, but there’s one that is the most prevalent: ending the test too soon.
Don’t stop the test just when you reach 95% confidence (or higher)
This is the first rule, and very important. It’s human to scream “yeah!” and want to stop the test, and roll the treatment out live. Many who do discover later (if they bother to check) that even though their test got like +20% uplift, it didn’t have any impact on the business. Because there was no actual lift – it was imaginary.
Consider this: One thousand A/A tests (two identical pages tested against each other) were run.
- 771 experiments out of 1.000 reached 90% significance at some point
- 531 experiments out of 1.000 reached 95% significance at some point
Quote from the experimenter:
This means if you’ve run 1.000 experiments and didn’t control for repeat testing error in any way, a rate of successful positive experiments up to 25% might be explained by a false positive rate. But you’ll see a temporary significant effect in around half of your experiments!
So if you stop your test as soon as you see significance, there’s a 50% chance it’s a complete fluke. A coin toss. Totally kills the idea of testing in the first place.
Once he altered the experiment so that he would pre-determine the needed sample size in advance, only 51 experiments out of 1.000 were significant at 95%. So by checking the sample size we went from 531 winning tests to 51 winning tests.
You can run this experiment yourself here.
How to pre-determine the needed sample size?
There are many great tools out there for that, like this one. Or here’s how you would do it with Evan Miller’s tool:
In this case we told the tool that we have a 3% conversion rate, and want to detect at least 10% uplift. The tool tells us that we need 51,486 visitors per variation before can look at the statistical significance levels and statistical power.
Magic numbers don’t exist
What about the rules like X amount of conversions per variation?
Even though you might come across statements like “you need 100 conversions per variation to end the test” – there is no magical traffic or conversion number. It’s slightly more complex than that.
Andrew Anderson, Head of Optimization at Malwarebytes
It is never about how many conversions, it is about having enough data to validate based on representative samples and representative behavior.
100 conversions is possible in only the most remote cases and with an incredibly high delta in behavior, but only if other requirements like behavior over time, consistency, and normal distribution take place. Even then it is has a really high chance of a type I error, false positive.
Anytime you see X number of conversions it is a pretty glaring sign that the person talking doesn’t understand the statistics at all.
And – if 100 conversions was the magic number, then big sites could end their tests just in minutes! That’s silly.
If you have a site that does 100,000 transactions per day, then 100 conversions can’t possibly be a representative of overall traffic.
So this leads to the next thing you need to take into account – representativeness of your sample size.
How representative is the traffic in the test?
By running tests you include a sample of visitors in an experiment. You need to make sure that the sample is representative of your overall, regular traffic. So that the sample would behave just as your real buyers behave.
Some want to suddenly increase the sample size by sending a bunch of atypical traffic to the experiment. If your traffic is low, should you blast your email list, or temporarily buy traffic to get large enough sample size for the test?
In most cases you’d be falling victim to selection effect – you wrongly assume some portion of the traffic represents the totality of the traffic. You might increase conversion for that segment, but don’t confuse that with an increase across segments.
Your test should run for 1 or better yet 2 business cycles, so it includes everything that going on:
- every day of the week (and tested one week at a time as your daily traffic can vary a lot),
- various different traffic sources (unless you want to personalize the experience for a dedicated source),
- your blog post and newsletter publishing schedule,
- people who visited your site, thought about it, and then came back 10 days later to buy it,
- any external event that might affect purchasing (e.g. pay day)
and so on.
Lukas Vermeer, Data Scientist at Booking.com
What matters much, much more than the exact number of visitors in your experiment is the representativeness of the sample, the size of the effect and your initial test intent.
If your sample is not a good representation of your overall traffic, then your results are not either. If your effect size is very large, then you need only a few visitors to detect. If you intended to run your test for a month, and you ran it for a month, and the difference is significant, then it’s frikkin’ significant.
Don’t waste your time looking for magic numbers: this is Science, not magic.
Be wary of statistical significance numbers (even if it’s 99%) when the sample size is small
So you ran a test where B beat A, and it was an impressive lift – perhaps +30%, +50% or even +100%. And then you look at the absolute numbers – and see that the sample size was something like 425 visitors. If B was 100% better, it could be 21 vs 42 conversions.
So when we punch the numbers into a calculator, we can definitely see how this could be significant.
BUT – hold your horses. Calculating statistical significance is an exercise is algebra, it’s not telling you what the reality is.
The thing is that since the sample size is so tiny (only 425 visitors), it’s prone to change dramatically if you keep the experiment going and increase the sample (the lift either vanishes or becomes much smaller, regression toward the mean). I typically ignore test results that have less than 250-350 conversions per variation since I’ve seen time and again that those numbers will change if you keep the test running, and the sample size gets bigger.
Anyone who has experience of running hundreds of tests can tell you that. A lot of the “early wins” disappear as you test longer, and increase the sample size.
I run most of my tests for at least 4 full weeks (even if needed sample size reached much earlier) – unless I get proof first that the numbers stabilize sooner (2 or 3 weeks) for a given site.
With low traffic, you need bigger wins to run a test per month, but…
Many sites have low traffic and low total monthly transaction count. So in order to call a test within 30 days, you need a big lift. Kyle Rush from Optimizely explains it eloquently here.
If you have bigger wins (e.g. +50%), you definitely can get by with smaller sample sizes. But it would be naive to think that smaller sites somehow can get bigger wins more easily than large sites. Everyone wants big wins. So saying “I’m going to swing big” is quite meaningless.
The only true tidbit here is that in order to get a more radical lift, you also need to test a more radical change. You can’t expect a large win when you just change the call to action.
Also, keep in mind: testing is not must-have mandatory component of optimization. You can also improve without testing.
Without seeing absolute numbers, be very suspicious
Most A/B testing case studies only publish relative increases. We got a 20% lift! 30% more signups! That’s very good, we want to know the relative difference. But can we trust these claims? Without knowing the absolute numbers, we can’t.
There are many reasons why someone doesn’t want to publish absolute numbers (fear of humiliation, fear of competition, overzealous legal department etc). I get it. There are a lot of case studies I’d like to publish, but my clients won’t allow it.
But the point remains – unless you can see test the duration, total sample size and conversion count per variation, you should remain skeptical. There’s a high chance they didn’t do it right, and the lift is imaginary.
Before you can declare a test “cooked”, you need to make sure there’s adequate sample size and test duration (to ensure good representativeness) before looking at confidence levels.