A/B Testing Mistakes — 20 Most Common Statistical Errors
A comprehensive reference of the 20 most common A/B testing mistakes organized by category, with severity ratings, real-world examples, and the correct approach for each one.
By Michael Lip · Updated April 2026
Methodology
Mistakes were identified by analyzing top-voted A/B testing questions on StackOverflow and stats.stackexchange.com via the StackExchange API v2.3, cross-referenced with published case studies from Google, Microsoft (ExP team), Booking.com, and Optimizely. Severity is rated Critical (invalidates results), High (significantly biases results), or Medium (reduces reliability). Each mistake includes a concrete example and the statistically correct alternative.
| # | Mistake | Category | Severity | Example | Correct Approach |
|---|---|---|---|---|---|
| 1 | Peeking and stopping early | Statistical | Critical | Stopping test after 3 days because p < 0.05 | Use sequential testing (SPRT) or run to pre-calculated sample size |
| 2 | No sample size calculation | Design | Critical | Running test until "it feels like enough data" | Calculate required n using power formula before launch |
| 3 | Multiple comparisons without correction | Statistical | Critical | Testing 20 metrics, declaring winner on any p < 0.05 | Apply Bonferroni or Benjamini-Hochberg correction |
| 4 | Using one-tailed test inappropriately | Statistical | High | One-tailed test to "make it easier to reach significance" | Use two-tailed unless you truly cannot act on decreases |
| 5 | Ignoring minimum detectable effect | Design | High | Expecting to detect 1% lift with 500 daily visitors | Set realistic MDE based on traffic and business impact |
| 6 | Not running full business cycles | Design | High | Running Mon-Thu and missing weekend traffic patterns | Always run at least 7 days (full week); ideally 2-4 weeks |
| 7 | Changing test mid-flight | Design | Critical | Modifying variant B's copy halfway through the test | Stop current test, start a new one with changes |
| 8 | Non-random assignment | Design | Critical | Assigning mobile to A and desktop to B | Use proper randomization at user level with persistent bucketing |
| 9 | Survivorship bias in analysis | Analysis | High | Only analyzing users who completed checkout | Use intent-to-treat: analyze all assigned users |
| 10 | Ignoring novelty effect | Analysis | High | Declaring winner after 1 week of an exciting new design | Run 2-4 weeks; segment new vs. returning users |
| 11 | Simpson's Paradox | Analysis | High | B wins in every segment but loses in aggregate | Check randomization balance; analyze segments and aggregate |
| 12 | Testing too many variants | Design | High | Running A/B/C/D/E with 1,000 daily visitors | Limit variants; n per variant must meet sample size requirement |
| 13 | Confusing statistical and practical significance | Analysis | Medium | p=0.04 for a 0.01% conversion lift — "significant!" | Always evaluate effect size and business impact alongside p-value |
| 14 | Not tracking guardrail metrics | Design | High | Conversion up 5% but page load time up 40% | Define guardrail metrics (latency, errors, revenue) before launch |
| 15 | Segment post-hoc fishing | Analysis | High | Slicing by 50 segments to find one where B wins | Pre-register segments of interest; correct for multiplicity |
| 16 | Wrong unit of randomization | Design | Critical | Randomizing per pageview, user sees both A and B | Randomize per user with consistent hashing or cookie-based bucketing |
| 17 | Ignoring interaction effects | Analysis | Medium | Running two tests simultaneously that affect the same page | Use factorial design or ensure non-overlapping test surfaces |
| 18 | Using conversion rate as only metric | Analysis | Medium | Conversion up but average order value down 20% | Track revenue per visitor as primary metric; conversion as secondary |
| 19 | Not validating tracking implementation | Design | High | A/A test shows 15% difference due to broken tracking | Run A/A test first to validate instrumentation and randomization |
| 20 | Declaring no effect from underpowered test | Statistical | High | "No significant difference" with 200 total conversions | Distinguish "no effect" from "unable to detect effect"; report CI |
Key Insights
6 of 20 mistakes are Critical — they invalidate results entirely. Peeking, no sample size calc, multiple comparisons, mid-flight changes, non-random assignment, and wrong randomization unit are not minor errors. They make your test results meaningless regardless of sample size or duration.
Design mistakes are preventable with a pre-test checklist. 8 of the 20 mistakes are design errors that occur before the test even starts. A simple checklist (sample size calculated? randomization unit correct? guardrails defined? A/A test passed?) prevents most of them.
Analysis mistakes are subtler but equally dangerous. Simpson's Paradox, survivorship bias, and segment fishing can all reverse the true direction of an effect. These require statistical literacy to detect — automated tools can flag some but not all of them.
The peeking problem is the #1 mistake in practice. Virtually every team that manually monitors A/B tests will peek. The solution is either automated stopping rules (sequential testing) or strict organizational discipline to wait for the predetermined sample size.
Frequently Asked Questions
What is the most common A/B testing mistake?
Peeking at results and stopping early when a "significant" result appears. This inflates the false positive rate from 5% to as high as 30%. The p-value is only valid at the pre-determined sample size. Use sequential testing (SPRT) or alpha-spending functions if you need to monitor continuously.
How does the multiple comparisons problem affect A/B tests?
Testing multiple metrics or variants without correction increases false positives. With 20 metrics at alpha=0.05, expect 1 false positive by chance. Apply Bonferroni (alpha/k) or Benjamini-Hochberg correction. Always pre-specify your primary metric before launching the test.
Why is it wrong to run an A/B test without calculating sample size first?
Without pre-calculated sample size, you have no power guarantee. An underpowered test (30% power) misses 70% of real improvements, leading to false "no effect" conclusions. The formula requires baseline rate, MDE, alpha, and desired power (typically 80%).
What is Simpson's Paradox in A/B testing?
Simpson's Paradox occurs when a trend in segmented data reverses in the aggregate. Variant B might win among both mobile and desktop users, yet A wins overall — because mobile users (lower conversion) were disproportionately assigned to B. Check randomization balance and analyze both segments and aggregate.
How do I avoid the novelty effect in A/B testing?
Run tests for at least 2-4 weeks to let novelty wear off. Segment new vs. returning users. Analyze daily conversion time series — if variant advantage decreases over time, novelty is a factor. Consider a holdback test after implementation to confirm the effect persists.