Long live A/B testing
A/B testing is the backbone of tech companies. They try, fail, and implement fast what works and what does not work. They believe in it as if this was the C-level making decisions and they have to obey the A/B testing results. I am the guy working in data science telling you that maybe you should listen to less to the “sample size calculator” the “paired t-test result”.
The more you do something, the more you get confident doing it. The more you automate your A/B testing the more you feel like “this works like a charm and I get 1% uplift, this pays for my salary, right?”
So what goes wrong in practice? What makes that you implemented 100 feature changes with 1% uplift and your final metric moved by 3% after 1 year of tough work. Let’s have a look at why your 2-week successful A/B testing went wrong.
New features that are customer-facing have a natural novelty effect, other features may have a competitor reaction effect. Novelty generates clicks and then it fades away. Typically, you will get the following graph with novelty effect.
Competitor’s reactions will take a bit more time but will eventually make your feature “the norm” and the impact will diminish. In some system only the final outcome matters … In this case the declining effect is omitted. It could also be that the effect declines far after the real impact.
Too many A/B testing running
Have you ever heard “we have a great traffic splitting system that can run hundreds of experiments concurrently”? Well, it works in many cases until you have a lot of experiments and one is not fully randomised. In practice, most of the traffic splitting is done on a part of the population with some mutual exclusion. But, imagine the case where you have an ML algorithm being used to serve recommendations and you are experimenting on the service selection menu right above this recommendation.
Your 2 experiments are deemed to be mostly mutually exclusive (not interference) and therefore the population are overlapping. Unfortunately, the ML model experiment performed a lot better on the most loyal users but the same on the rest of the users and somehow, your experiment performed worse…