Original Research

A/B Testing Mistakes — 20 Most Common Statistical Errors

A comprehensive reference of the 20 most common A/B testing mistakes organized by category, with severity ratings, real-world examples, and the correct approach for each one.

By Michael Lip · Updated April 2026

Methodology

Mistakes were identified by analyzing top-voted A/B testing questions on StackOverflow and stats.stackexchange.com via the StackExchange API v2.3, cross-referenced with published case studies from Google, Microsoft (ExP team), Booking.com, and Optimizely. Severity is rated Critical (invalidates results), High (significantly biases results), or Medium (reduces reliability). Each mistake includes a concrete example and the statistically correct alternative.

# Mistake Category Severity Example Correct Approach
1Peeking and stopping earlyStatisticalCriticalStopping test after 3 days because p < 0.05Use sequential testing (SPRT) or run to pre-calculated sample size
2No sample size calculationDesignCriticalRunning test until "it feels like enough data"Calculate required n using power formula before launch
3Multiple comparisons without correctionStatisticalCriticalTesting 20 metrics, declaring winner on any p < 0.05Apply Bonferroni or Benjamini-Hochberg correction
4Using one-tailed test inappropriatelyStatisticalHighOne-tailed test to "make it easier to reach significance"Use two-tailed unless you truly cannot act on decreases
5Ignoring minimum detectable effectDesignHighExpecting to detect 1% lift with 500 daily visitorsSet realistic MDE based on traffic and business impact
6Not running full business cyclesDesignHighRunning Mon-Thu and missing weekend traffic patternsAlways run at least 7 days (full week); ideally 2-4 weeks
7Changing test mid-flightDesignCriticalModifying variant B's copy halfway through the testStop current test, start a new one with changes
8Non-random assignmentDesignCriticalAssigning mobile to A and desktop to BUse proper randomization at user level with persistent bucketing
9Survivorship bias in analysisAnalysisHighOnly analyzing users who completed checkoutUse intent-to-treat: analyze all assigned users
10Ignoring novelty effectAnalysisHighDeclaring winner after 1 week of an exciting new designRun 2-4 weeks; segment new vs. returning users
11Simpson's ParadoxAnalysisHighB wins in every segment but loses in aggregateCheck randomization balance; analyze segments and aggregate
12Testing too many variantsDesignHighRunning A/B/C/D/E with 1,000 daily visitorsLimit variants; n per variant must meet sample size requirement
13Confusing statistical and practical significanceAnalysisMediump=0.04 for a 0.01% conversion lift — "significant!"Always evaluate effect size and business impact alongside p-value
14Not tracking guardrail metricsDesignHighConversion up 5% but page load time up 40%Define guardrail metrics (latency, errors, revenue) before launch
15Segment post-hoc fishingAnalysisHighSlicing by 50 segments to find one where B winsPre-register segments of interest; correct for multiplicity
16Wrong unit of randomizationDesignCriticalRandomizing per pageview, user sees both A and BRandomize per user with consistent hashing or cookie-based bucketing
17Ignoring interaction effectsAnalysisMediumRunning two tests simultaneously that affect the same pageUse factorial design or ensure non-overlapping test surfaces
18Using conversion rate as only metricAnalysisMediumConversion up but average order value down 20%Track revenue per visitor as primary metric; conversion as secondary
19Not validating tracking implementationDesignHighA/A test shows 15% difference due to broken trackingRun A/A test first to validate instrumentation and randomization
20Declaring no effect from underpowered testStatisticalHigh"No significant difference" with 200 total conversionsDistinguish "no effect" from "unable to detect effect"; report CI

Key Insights

6 of 20 mistakes are Critical — they invalidate results entirely. Peeking, no sample size calc, multiple comparisons, mid-flight changes, non-random assignment, and wrong randomization unit are not minor errors. They make your test results meaningless regardless of sample size or duration.

Design mistakes are preventable with a pre-test checklist. 8 of the 20 mistakes are design errors that occur before the test even starts. A simple checklist (sample size calculated? randomization unit correct? guardrails defined? A/A test passed?) prevents most of them.

Analysis mistakes are subtler but equally dangerous. Simpson's Paradox, survivorship bias, and segment fishing can all reverse the true direction of an effect. These require statistical literacy to detect — automated tools can flag some but not all of them.

The peeking problem is the #1 mistake in practice. Virtually every team that manually monitors A/B tests will peek. The solution is either automated stopping rules (sequential testing) or strict organizational discipline to wait for the predetermined sample size.

Frequently Asked Questions

What is the most common A/B testing mistake?

Peeking at results and stopping early when a "significant" result appears. This inflates the false positive rate from 5% to as high as 30%. The p-value is only valid at the pre-determined sample size. Use sequential testing (SPRT) or alpha-spending functions if you need to monitor continuously.

How does the multiple comparisons problem affect A/B tests?

Testing multiple metrics or variants without correction increases false positives. With 20 metrics at alpha=0.05, expect 1 false positive by chance. Apply Bonferroni (alpha/k) or Benjamini-Hochberg correction. Always pre-specify your primary metric before launching the test.

Why is it wrong to run an A/B test without calculating sample size first?

Without pre-calculated sample size, you have no power guarantee. An underpowered test (30% power) misses 70% of real improvements, leading to false "no effect" conclusions. The formula requires baseline rate, MDE, alpha, and desired power (typically 80%).

What is Simpson's Paradox in A/B testing?

Simpson's Paradox occurs when a trend in segmented data reverses in the aggregate. Variant B might win among both mobile and desktop users, yet A wins overall — because mobile users (lower conversion) were disproportionately assigned to B. Check randomization balance and analyze both segments and aggregate.

How do I avoid the novelty effect in A/B testing?

Run tests for at least 2-4 weeks to let novelty wear off. Segment new vs. returning users. Analyze daily conversion time series — if variant advantage decreases over time, novelty is a factor. Consider a holdback test after implementation to confirm the effect persists.