Question 1

What is the most common A/B testing mistake?

Accepted Answer

The most common mistake is peeking at results and stopping the test early when a 'significant' result appears. This inflates the false positive rate from the intended 5% to as high as 30%. The p-value is only valid at the pre-determined sample size. If you must monitor results continuously, use sequential testing methods like SPRT (Sequential Probability Ratio Test) or alpha-spending functions that mathematically account for multiple looks.

Question 2

How does the multiple comparisons problem affect A/B tests?

Accepted Answer

Testing multiple metrics or variants without correction dramatically increases false positives. With 20 metrics at alpha=0.05, you expect 1 false positive by chance alone. The Bonferroni correction divides alpha by the number of comparisons (0.05/20 = 0.0025 per test). The less conservative Benjamini-Hochberg procedure controls the false discovery rate instead. Always pre-specify your primary metric before running the test.

Question 3

Why is it wrong to run an A/B test without calculating sample size first?

Accepted Answer

Without a pre-calculated sample size, you have no guarantee of statistical power — the probability of detecting a real effect. An underpowered test (e.g., 30% power) means you will miss 70% of real improvements, leading to the false conclusion that changes have no effect. The sample size formula requires specifying your baseline conversion rate, minimum detectable effect, significance level (alpha), and desired power (typically 80%).

Question 4

What is Simpson's Paradox in A/B testing?

Accepted Answer

Simpson's Paradox occurs when a trend that appears in segmented data reverses when the data is combined. In A/B testing, variant B might outperform A among both mobile and desktop users, yet A outperforms B in the aggregate — because mobile users (who convert less overall) were disproportionately assigned to B. This is why randomization must be checked and why segment-level analysis should complement aggregate results.

Question 5

How do I avoid the novelty effect in A/B testing?

Accepted Answer

The novelty effect causes inflated metrics for new designs simply because they are new and attract curiosity. To mitigate it: (1) Run tests for at least 2-4 weeks to let the effect wear off. (2) Segment results by new vs. returning users — new users are less affected by novelty. (3) Analyze the time series of daily conversion rates — if the variant's advantage decreases over time, novelty is likely a factor. (4) Consider running a holdback test after implementation.

#	Mistake	Category	Severity	Example	Correct Approach
1	Peeking and stopping early	Statistical	Critical	Stopping test after 3 days because p < 0.05	Use sequential testing (SPRT) or run to pre-calculated sample size
2	No sample size calculation	Design	Critical	Running test until "it feels like enough data"	Calculate required n using power formula before launch
3	Multiple comparisons without correction	Statistical	Critical	Testing 20 metrics, declaring winner on any p < 0.05	Apply Bonferroni or Benjamini-Hochberg correction
4	Using one-tailed test inappropriately	Statistical	High	One-tailed test to "make it easier to reach significance"	Use two-tailed unless you truly cannot act on decreases
5	Ignoring minimum detectable effect	Design	High	Expecting to detect 1% lift with 500 daily visitors	Set realistic MDE based on traffic and business impact
6	Not running full business cycles	Design	High	Running Mon-Thu and missing weekend traffic patterns	Always run at least 7 days (full week); ideally 2-4 weeks
7	Changing test mid-flight	Design	Critical	Modifying variant B's copy halfway through the test	Stop current test, start a new one with changes
8	Non-random assignment	Design	Critical	Assigning mobile to A and desktop to B	Use proper randomization at user level with persistent bucketing
9	Survivorship bias in analysis	Analysis	High	Only analyzing users who completed checkout	Use intent-to-treat: analyze all assigned users
10	Ignoring novelty effect	Analysis	High	Declaring winner after 1 week of an exciting new design	Run 2-4 weeks; segment new vs. returning users
11	Simpson's Paradox	Analysis	High	B wins in every segment but loses in aggregate	Check randomization balance; analyze segments and aggregate
12	Testing too many variants	Design	High	Running A/B/C/D/E with 1,000 daily visitors	Limit variants; n per variant must meet sample size requirement
13	Confusing statistical and practical significance	Analysis	Medium	p=0.04 for a 0.01% conversion lift — "significant!"	Always evaluate effect size and business impact alongside p-value
14	Not tracking guardrail metrics	Design	High	Conversion up 5% but page load time up 40%	Define guardrail metrics (latency, errors, revenue) before launch
15	Segment post-hoc fishing	Analysis	High	Slicing by 50 segments to find one where B wins	Pre-register segments of interest; correct for multiplicity
16	Wrong unit of randomization	Design	Critical	Randomizing per pageview, user sees both A and B	Randomize per user with consistent hashing or cookie-based bucketing
17	Ignoring interaction effects	Analysis	Medium	Running two tests simultaneously that affect the same page	Use factorial design or ensure non-overlapping test surfaces
18	Using conversion rate as only metric	Analysis	Medium	Conversion up but average order value down 20%	Track revenue per visitor as primary metric; conversion as secondary
19	Not validating tracking implementation	Design	High	A/A test shows 15% difference due to broken tracking	Run A/A test first to validate instrumentation and randomization
20	Declaring no effect from underpowered test	Statistical	High	"No significant difference" with 200 total conversions	Distinguish "no effect" from "unable to detect effect"; report CI

A/B Testing Mistakes — 20 Most Common Statistical Errors

Methodology

Key Insights

Frequently Asked Questions

Related Tools