A/B Test Duration Estimator
Calculate the minimum runtime for your A/B test based on daily traffic, baseline conversion rate, minimum detectable effect, and statistical power. Includes early stopping analysis with sequential testing boundaries, alpha spending functions, and a peeking penalty calculator that shows how checking results early inflates your false positive rate. All computation runs in your browser.
Test Parameters
How A/B Test Duration Is Calculated
The minimum duration for an A/B test is determined by the required sample size, which depends on four key inputs: the baseline conversion rate, the minimum detectable effect (MDE), the significance level (alpha), and the statistical power (1 minus beta). The sample size formula for a two-proportion z-test is n = (z_{alpha/2} + z_{beta})^2 * (p1*(1-p1) + p2*(1-p2)) / (p2 - p1)^2 per variant, where p1 is the baseline rate, p2 is p1 * (1 + MDE/100) for a relative MDE, and the z-values correspond to the standard normal quantiles for the chosen alpha and power levels.
Once you have the per-variant sample size, dividing by the daily traffic per variant gives the minimum number of days. However, this raw number should always be rounded up to complete full weeks. Day-of-week effects are real and substantial in most businesses: weekend traffic often differs dramatically from weekday traffic in both volume and behavior. A test that runs from Monday to Friday captures a biased sample. Always run in full seven-day increments to capture the complete weekly cycle, and consider extending to two or three full weeks if your traffic has strong weekly seasonality.
The Danger of Peeking at Results
One of the most common mistakes in A/B testing is repeatedly checking whether results are significant before the planned sample size is reached. This practice, known as peeking or optional stopping, dramatically inflates the false positive rate. The intuition is straightforward: under the null hypothesis, the z-statistic follows a random walk. Given enough time, a random walk will eventually cross any fixed boundary. Each peek gives the random walk another chance to cross the significance threshold by chance.
The mathematics confirm this intuition. If you peek k times during a test at a nominal alpha of 0.05, the actual false positive rate is bounded above by 1 - (1 - alpha)^k (the union bound). For 10 peeks, this gives an actual alpha of approximately 0.40, eight times the nominal rate. In practice, the actual inflation is somewhat less severe because the peeks are correlated (the z-statistic does not reset between checks), but simulation studies consistently show that daily peeking over a 30-day test raises the effective alpha from 0.05 to somewhere between 0.14 and 0.20, depending on the traffic accumulation pattern.
The peeking penalty calculator in this tool computes both the union bound (worst case) and the Sidak correction (which accounts for some correlation). It also shows the Bonferroni-corrected alpha per peek that you would need to use at each check in order to maintain the desired overall false positive rate. For most practical purposes, the solution is simpler: determine your sample size in advance, commit to the full test duration, and only analyze results after the planned sample is collected.
Sequential Testing: Principled Early Stopping
Sequential testing methods provide a mathematically rigorous framework for looking at your data during the test without inflating the false positive rate. The core idea is to use progressively stricter or more lenient significance boundaries at each interim analysis, carefully budgeting the total alpha across all looks. The two most common approaches are the O'Brien-Fleming and Pocock group sequential boundaries.
O'Brien-Fleming boundaries start extremely conservative and become progressively more lenient. At the first interim look (say, 20% of the planned sample), the critical z-value might be 4.56, making early stopping virtually impossible unless the effect is enormous. By the final analysis, the critical z-value is approximately 2.04 for a 5-look design, barely more than the standard 1.96. This means you lose almost no power compared to a fixed-sample test. The O'Brien-Fleming approach is ideal when you expect moderate effects and want the option to stop early only for surprisingly large effects.
Pocock boundaries use the same critical z-value at every interim look. For a 5-look design at alpha = 0.05, each look uses z = 2.41. This makes early stopping easier than O'Brien-Fleming but requires a larger total sample size (typically 15-30% more) to maintain equivalent power. The Pocock approach is appropriate when you believe the effect could be very large and want maximum early stopping flexibility.
The alpha spending function generalizes these approaches. The O'Brien-Fleming spending function is alpha(t) = 2 - 2 * Phi(z_{alpha/2} / sqrt(t)), where t is the information fraction (proportion of planned sample collected) and Phi is the standard normal CDF. The Pocock spending function is alpha(t) = alpha * ln(1 + (e - 1) * t). The Lan-DeMets approach allows you to define any spending function and derive the appropriate boundaries, even when the interim analyses are not equally spaced.
Calendar Planning for A/B Tests
Beyond the raw number of days, practical test planning requires calendar awareness. Avoid starting tests on holidays or during promotional periods that alter normal behavior. If your business has strong seasonal patterns (e-commerce around Black Friday, tax software in March), tests conducted during atypical periods may not generalize. The calendar view in this tool shows your recommended run period starting from tomorrow, aligned to full weeks, so you can immediately identify whether the test window overlaps with any known disruptions.
For multi-variant tests (A/B/C/D), the traffic is split across more variants, so each variant receives fewer daily visitors and the test takes proportionally longer. With 4 variants and fixed total daily traffic, the test takes twice as long as a simple A/B test. This is one reason to be judicious about the number of variants: each additional variant significantly extends the required duration. If you must test many variants, consider a multi-armed bandit approach or a preliminary screening phase followed by a confirmatory A/B test of the top candidates.
Interpreting the Results
The sample size calculator gives you a minimum requirement, not a sufficient one. Several factors can increase the actual duration needed. First, real-world traffic fluctuates: weekends and holidays reduce daily visitors, extending the calendar time to reach the planned sample. Second, if users return multiple times during the test, you have fewer independent observations than raw pageviews suggest. Third, the sample size formula assumes a two-sided test; if you are testing a directional hypothesis, a one-sided test requires a smaller sample. Fourth, if you are running multiple tests simultaneously on the same traffic, the interference between tests can bias results.
A useful rule of thumb: take the calculator's minimum duration, round up to the next full week, and add one additional week as a buffer. If the result is fewer than 14 days, run for at least 14 days regardless. Two full weeks captures two complete weekly cycles and provides enough data to detect day-of-week interactions. If your sample size requirement is very large (exceeding three months of traffic), reconsider whether the MDE is realistic or whether you should focus on larger changes that are easier to detect.
Frequently Asked Questions
How long should I run my A/B test?
The minimum duration depends on your daily traffic, baseline conversion rate, minimum detectable effect, and desired power. Use the formula n = (z_{alpha/2} + z_{beta})^2 * (p1*(1-p1) + p2*(1-p2)) / delta^2 per variant, divided by daily visitors per variant. Always round up to full weeks to capture day-of-week effects.
What is early stopping in A/B testing and why is it dangerous?
Early stopping means ending a test before the planned sample size because interim results look significant. This inflates false positive rates far beyond the nominal alpha. With daily peeking at alpha = 0.05 over 30 days, the actual false positive rate can exceed 20%. Use sequential testing boundaries (O'Brien-Fleming or Pocock) for principled early stopping.
What is an alpha spending function?
An alpha spending function controls how the total Type I error budget is distributed across interim analyses. O'Brien-Fleming spends very little alpha early, preserving power. Pocock distributes alpha evenly. The Lan-DeMets approach generalizes this with a continuous function alpha(t) where t is the information fraction.
How does peeking at A/B test results inflate false positives?
Each peek gives the test statistic's random walk another chance to cross the significance boundary by chance. With 10 peeks at alpha = 0.05, the actual false positive rate can reach 14-40%. The peeking penalty calculator quantifies this inflation and shows the corrected alpha you need per peek.
Should I use O'Brien-Fleming or Pocock boundaries for my sequential test?
O'Brien-Fleming is preferred for most A/B tests. It preserves nearly all power of a fixed-sample test and the final critical value is close to 1.96. Pocock uses equal boundaries, making early stopping easier but requiring 15-30% more total samples. Choose Pocock only when you strongly expect a large effect.
Related A/B Testing Tools
- Sample Size Calculator — Quick sample size estimation for A/B tests
- Conversion Rate Calculator — Conversion rates with confidence intervals
- Power Analysis Calculator — Determine required sample size with power curves
- How Long to Run an A/B Test — Deep-dive guide on test duration planning
About the Author
Built by Michael Lip — Solo developer with 10+ years experience. 140+ PRs merged into open source projects including Google Chrome and Axios. Creator of 20+ Chrome extensions and the Zovo developer tools collection.