A/B Test Duration Estimator

Q: How long should I run my A/B test?

The minimum duration depends on four factors: your daily traffic, baseline conversion rate, minimum detectable effect (MDE), and desired statistical power. The formula derives from the sample size requirement n = (z_alpha/2 + z_beta)^2 * (p1*(1-p1) + p2*(1-p2)) / (p2 - p1)^2 per variant, divided by your daily traffic per variant. For example, with 1,000 daily visitors, a 5% baseline rate, 10% relative MDE, and 80% power, you need approximately 31,000 visitors per variant, or about 62 days. Always run for at least one full business week cycle to capture day-of-week effects.

Q: What is early stopping in A/B testing and why is it dangerous?

Early stopping means ending an A/B test before the pre-determined sample size is reached, typically because the results look significant at an interim analysis. This is dangerous because it inflates the false positive rate far beyond the nominal alpha level. If you check your test daily at alpha = 0.05, the cumulative false positive rate after 10 checks can exceed 25%. Sequential testing methods like the O'Brien-Fleming or Pocock boundaries provide a principled framework for early stopping that controls the overall Type I error rate by using stricter significance thresholds at early interim analyses.

Q: How does peeking at A/B test results inflate false positives?

Each time you check a running test for significance, you create an additional opportunity for a random fluctuation to cross the significance threshold. Under the null hypothesis, the test statistic follows a random walk, and the probability that it ever crosses a fixed boundary increases with the number of checks. With daily peeking over 30 days at alpha = 0.05, the true false positive rate can reach 14-20%, depending on the traffic pattern. The peeking penalty calculator quantifies this inflation so you can see exactly how much your effective alpha increases with each additional look.

Q: Should I use O'Brien-Fleming or Pocock boundaries for my sequential test?

O'Brien-Fleming boundaries are preferred for most A/B testing scenarios because they preserve nearly all the statistical power of a fixed-sample test while still allowing early stopping for very large effects. The final-analysis critical value is close to 1.96, so you lose almost nothing if the test runs to completion. Pocock boundaries use equal critical values at each look, making early stopping easier but requiring a 15-30% larger total sample size to maintain the same power. Choose Pocock only when you strongly expect a large effect and want maximum flexibility to stop early. For routine optimization tests with expected small effects, O'Brien-Fleming is almost always the better choice.

Calculate the minimum runtime for your A/B test based on daily traffic, baseline conversion rate, minimum detectable effect, and statistical power. Includes early stopping analysis with sequential testing boundaries, alpha spending functions, and a peeking penalty calculator that shows how checking results early inflates your false positive rate. All computation runs in your browser.

Test Parameters

Daily Visitors (total across all variants)

Baseline Conversion Rate (%)

Minimum Detectable Effect (% relative)

Number of Variants (including control)

Significance Level (α)

Statistical Power (1 - β)

Minimum Test Duration

Required Sample Size (per variant)--

Total Sample Size (all variants)--

Minimum Duration--

Recommended Duration (full weeks)--

Absolute Effect Size--

Detectable Range--

Recommended Run Period

Green days show your minimum test window. Start date is tomorrow. Full-week alignment is recommended.

Sequential Testing Boundaries (Early Stopping)

Critical z-values at each interim look. A test statistic exceeding the boundary allows early stopping while controlling the overall false positive rate.

Sequential Boundary Plot

Peeking Penalty Calculator

See how checking your test results early inflates the false positive rate above the nominal α.

Number of Times You Peek

Nominal α Per Peek

Alpha Spending Function

How the total α budget is distributed across interim analyses under each spending approach.

O'Brien-Fleming spends almost no alpha early and nearly all at the final look, preserving maximum power. Pocock spends alpha more evenly, enabling earlier stopping but requiring a larger total sample. The linear spending function (Lan-DeMets) provides a middle ground.

How A/B Test Duration Is Calculated

The minimum duration for an A/B test is determined by the required sample size, which depends on four key inputs: the baseline conversion rate, the minimum detectable effect (MDE), the significance level (alpha), and the statistical power (1 minus beta). The sample size formula for a two-proportion z-test is n = (z_{alpha/2} + z_{beta})^2 * (p1*(1-p1) + p2*(1-p2)) / (p2 - p1)^2 per variant, where p1 is the baseline rate, p2 is p1 * (1 + MDE/100) for a relative MDE, and the z-values correspond to the standard normal quantiles for the chosen alpha and power levels.

Once you have the per-variant sample size, dividing by the daily traffic per variant gives the minimum number of days. However, this raw number should always be rounded up to complete full weeks. Day-of-week effects are real and substantial in most businesses: weekend traffic often differs dramatically from weekday traffic in both volume and behavior. A test that runs from Monday to Friday captures a biased sample. Always run in full seven-day increments to capture the complete weekly cycle, and consider extending to two or three full weeks if your traffic has strong weekly seasonality.

The Danger of Peeking at Results

One of the most common mistakes in A/B testing is repeatedly checking whether results are significant before the planned sample size is reached. This practice, known as peeking or optional stopping, dramatically inflates the false positive rate. The intuition is straightforward: under the null hypothesis, the z-statistic follows a random walk. Given enough time, a random walk will eventually cross any fixed boundary. Each peek gives the random walk another chance to cross the significance threshold by chance.

The mathematics confirm this intuition. If you peek k times during a test at a nominal alpha of 0.05, the actual false positive rate is bounded above by 1 - (1 - alpha)^k (the union bound). For 10 peeks, this gives an actual alpha of approximately 0.40, eight times the nominal rate. In practice, the actual inflation is somewhat less severe because the peeks are correlated (the z-statistic does not reset between checks), but simulation studies consistently show that daily peeking over a 30-day test raises the effective alpha from 0.05 to somewhere between 0.14 and 0.20, depending on the traffic accumulation pattern.

The peeking penalty calculator in this tool computes both the union bound (worst case) and the Sidak correction (which accounts for some correlation). It also shows the Bonferroni-corrected alpha per peek that you would need to use at each check in order to maintain the desired overall false positive rate. For most practical purposes, the solution is simpler: determine your sample size in advance, commit to the full test duration, and only analyze results after the planned sample is collected.

Sequential Testing: Principled Early Stopping

Sequential testing methods provide a mathematically rigorous framework for looking at your data during the test without inflating the false positive rate. The core idea is to use progressively stricter or more lenient significance boundaries at each interim analysis, carefully budgeting the total alpha across all looks. The two most common approaches are the O'Brien-Fleming and Pocock group sequential boundaries.

O'Brien-Fleming boundaries start extremely conservative and become progressively more lenient. At the first interim look (say, 20% of the planned sample), the critical z-value might be 4.56, making early stopping virtually impossible unless the effect is enormous. By the final analysis, the critical z-value is approximately 2.04 for a 5-look design, barely more than the standard 1.96. This means you lose almost no power compared to a fixed-sample test. The O'Brien-Fleming approach is ideal when you expect moderate effects and want the option to stop early only for surprisingly large effects.

Pocock boundaries use the same critical z-value at every interim look. For a 5-look design at alpha = 0.05, each look uses z = 2.41. This makes early stopping easier than O'Brien-Fleming but requires a larger total sample size (typically 15-30% more) to maintain equivalent power. The Pocock approach is appropriate when you believe the effect could be very large and want maximum early stopping flexibility.

The alpha spending function generalizes these approaches. The O'Brien-Fleming spending function is alpha(t) = 2 - 2 * Phi(z_{alpha/2} / sqrt(t)), where t is the information fraction (proportion of planned sample collected) and Phi is the standard normal CDF. The Pocock spending function is alpha(t) = alpha * ln(1 + (e - 1) * t). The Lan-DeMets approach allows you to define any spending function and derive the appropriate boundaries, even when the interim analyses are not equally spaced.

Calendar Planning for A/B Tests

Beyond the raw number of days, practical test planning requires calendar awareness. Avoid starting tests on holidays or during promotional periods that alter normal behavior. If your business has strong seasonal patterns (e-commerce around Black Friday, tax software in March), tests conducted during atypical periods may not generalize. The calendar view in this tool shows your recommended run period starting from tomorrow, aligned to full weeks, so you can immediately identify whether the test window overlaps with any known disruptions.

For multi-variant tests (A/B/C/D), the traffic is split across more variants, so each variant receives fewer daily visitors and the test takes proportionally longer. With 4 variants and fixed total daily traffic, the test takes twice as long as a simple A/B test. This is one reason to be judicious about the number of variants: each additional variant significantly extends the required duration. If you must test many variants, consider a multi-armed bandit approach or a preliminary screening phase followed by a confirmatory A/B test of the top candidates.

Interpreting the Results

The sample size calculator gives you a minimum requirement, not a sufficient one. Several factors can increase the actual duration needed. First, real-world traffic fluctuates: weekends and holidays reduce daily visitors, extending the calendar time to reach the planned sample. Second, if users return multiple times during the test, you have fewer independent observations than raw pageviews suggest. Third, the sample size formula assumes a two-sided test; if you are testing a directional hypothesis, a one-sided test requires a smaller sample. Fourth, if you are running multiple tests simultaneously on the same traffic, the interference between tests can bias results.

A useful rule of thumb: take the calculator's minimum duration, round up to the next full week, and add one additional week as a buffer. If the result is fewer than 14 days, run for at least 14 days regardless. Two full weeks captures two complete weekly cycles and provides enough data to detect day-of-week interactions. If your sample size requirement is very large (exceeding three months of traffic), reconsider whether the MDE is realistic or whether you should focus on larger changes that are easier to detect.

Last updated: May 25, 2026

Frequently Asked Questions

How long should I run my A/B test?

The minimum duration depends on your daily traffic, baseline conversion rate, minimum detectable effect, and desired power. Use the formula n = (z_{alpha/2} + z_{beta})^2 * (p1*(1-p1) + p2*(1-p2)) / delta^2 per variant, divided by daily visitors per variant. Always round up to full weeks to capture day-of-week effects.

What is early stopping in A/B testing and why is it dangerous?

Early stopping means ending a test before the planned sample size because interim results look significant. This inflates false positive rates far beyond the nominal alpha. With daily peeking at alpha = 0.05 over 30 days, the actual false positive rate can exceed 20%. Use sequential testing boundaries (O'Brien-Fleming or Pocock) for principled early stopping.

What is an alpha spending function?

An alpha spending function controls how the total Type I error budget is distributed across interim analyses. O'Brien-Fleming spends very little alpha early, preserving power. Pocock distributes alpha evenly. The Lan-DeMets approach generalizes this with a continuous function alpha(t) where t is the information fraction.

How does peeking at A/B test results inflate false positives?

Each peek gives the test statistic's random walk another chance to cross the significance boundary by chance. With 10 peeks at alpha = 0.05, the actual false positive rate can reach 14-40%. The peeking penalty calculator quantifies this inflation and shows the corrected alpha you need per peek.

Should I use O'Brien-Fleming or Pocock boundaries for my sequential test?

O'Brien-Fleming is preferred for most A/B tests. It preserves nearly all power of a fixed-sample test and the final critical value is close to 1.96. Pocock uses equal boundaries, making early stopping easier but requiring 15-30% more total samples. Choose Pocock only when you strongly expect a large effect.

Related A/B Testing Tools

Sample Size Calculator — Quick sample size estimation for A/B tests
Conversion Rate Calculator — Conversion rates with confidence intervals
Power Analysis Calculator — Determine required sample size with power curves
How Long to Run an A/B Test — Deep-dive guide on test duration planning

About the Author

Built by Michael Lip — Solo developer with 10+ years experience. 140+ PRs merged into open source projects including Google Chrome and Axios. Creator of 20+ Chrome extensions and the Zovo developer tools collection.