4/20/2022
Published in Towards Data Science.
Data validation issues can deflate an A/B testing program quickly—and Sample Ratio Mismatch (SRM) is one of the most damaging. When control and variation groups are off from the expected 50/50 split (e.g. by ~7%), should we stop the test? How sure are we a problem exists? What about false positives, or the right traffic volume to run an SRM check?
This article uses Python simulations (rather than theory alone) to explore these questions. It simulates traffic being randomly assigned to A and B (A/A-style experiments), runs the Chi-squared SRM check from a previous TDS article, and aggregates results across thousands of runs at different traffic volumes.
Findings include: the Type I (false positive) rate stays around 1% when using p < 0.01, regardless of traffic; at low traffic (e.g. 1k–2k users) average and max differences between groups are larger, and the Chi test becomes more sensitive as traffic grows (e.g. <5% difference to trigger SRM at 10k+ users). On continuous monitoring: checking for SRM repeatedly as traffic accumulates can increase the chance of a false positive at the test level (e.g. ~17% in the corrected simulation), so the advice is to monitor for SRM early and often but watch for trends rather than a single check. The piece includes code, result tables, and plots (e.g. SRM metrics vs traffic, standard deviation vs volume).