Sequential Testing Tool

Generate O'Brien-Fleming stopping boundaries for your A/B test. Enter your test parameters and optionally paste daily results to see whether you can stop today. All computation is client-side — no data leaves your browser.

Test Parameters

+ Enter live results

Live Results — Enter Cumulative Data Per Day

Enter cumulative conversions and visitors for each day since test start. Leave blank for days not yet reached. The chart will overlay your actual test statistic vs the stopping boundaries.

Test Plan Summary

Required sample size (per variant)
Estimated test duration
Interim look schedule
Inflation factor (vs fixed horizon)
Adjusted α (spending at each look)
Sequential Boundaries — Z-Statistic vs O'Brien-Fleming Thresholds

Day-by-Day Stopping Schedule

Stop the test on a given day only if your cumulative Z-statistic exceeds the upper boundary (variant wins) or falls below the lower boundary (variant clearly inferior). Otherwise continue.

What Is Sequential Testing and Why Does It Matter?

In traditional fixed-horizon A/B testing, you decide the sample size in advance and look at the results exactly once after collecting all the data. This is methodologically clean but impractical: most teams peek at results continuously, which inflates the type I error rate far above the nominal α. Checking a p < 0.05 test at every new data point gives you a false positive rate of over 30%, not 5%. Sequential testing solves this by adjusting the stopping boundaries based on how many times you plan to look at the data.

O'Brien-Fleming boundaries are the most widely used sequential testing approach. They apply a stringent threshold early in the test when little data has been collected, and relax the threshold as the test approaches its planned end. The intuition is sound: claiming significance when you have only 10% of your planned data requires overwhelming evidence (much higher Z-statistic than 1.96). As you accumulate more data, the boundary approaches the standard critical value. This means you preserve most of the statistical power of the fixed-horizon test while allowing early stopping when the evidence is compelling.

The O'Brien-Fleming boundaries are computed using the spending function approach. At each interim look, the cumulative alpha spent is a function of the information fraction (data collected so far divided by planned total). The boundary at information fraction t is approximately: b(t) = zα/2 / √t, where zα/2 is the standard normal critical value. This ensures the overall type I error across all looks equals the target α.

How to Use Sequential Boundaries in Practice

Once you generate the boundary schedule, use it as your decision framework for the entire test. Before launching, print or save the table. Each day (or at each planned interim look), compute your cumulative test statistic: Z = (p̂B − p̂A) / SE, where SE = √(p̂(1−p̂)(1/nA + 1/nB)) and p̂ is the pooled proportion. If |Z| exceeds the day's boundary, you can stop the test.

If the test statistic crosses the upper boundary (positive side), variant B is significantly better — stop and ship. If it crosses the lower boundary (negative side, in the two-sided sense), variant A is significantly better — stop and keep control. If neither boundary is crossed by the final day, the test completes with the standard conclusion: reject or fail to reject based on the final look's boundary, which is close to the standard zα/2.

The key discipline: only check results at the pre-specified look times. If you scheduled 5 interim looks at days 7, 14, 21, 28, and 35, do not check on days 10, 12, and 15 just because you are curious. Ad hoc peeking undermines the statistical validity even with O'Brien-Fleming boundaries, because the boundaries are computed for the specific schedule you chose, not for arbitrary peeks.

Sample Size Inflation With Sequential Testing

Sequential testing requires a slightly larger maximum sample than fixed-horizon testing because the boundaries are more conservative early on. The inflation factor depends on the number of planned looks. For O'Brien-Fleming with 5 looks, the required maximum sample is approximately 1% to 3% larger than the fixed-horizon sample. With 10 looks, the inflation is 5% to 8%. This is a very modest cost for the benefit of potentially stopping weeks early if the effect is strong.

The expected sample size under the alternative hypothesis (when a true effect of size MDE exists) is substantially smaller than the maximum, because the test often stops early when the effect is real. For a 5-look O'Brien-Fleming design, the expected sample given a true effect of 1.5× the MDE is roughly 50% to 70% of the maximum. This is the practical efficiency gain: when your variant is meaningfully better, you stop much sooner on average than a fixed-horizon test would require.

Futility Stopping

Beyond the efficacy boundaries (stop because the variant is clearly better or worse), some sequential designs also include futility boundaries: stop early because the effect is so small that reaching significance by the planned end date is essentially impossible. Futility stopping does not control a statistical error rate in the same way as efficacy stopping — it is a practical decision to reallocate resources when a test is clearly going nowhere. The tool currently displays efficacy boundaries. Futility stopping is implicitly visible in the chart: if the test statistic stays close to zero halfway through, the test is likely futile.

Why O'Brien-Fleming Is Preferred Over Other Spending Functions

Several alpha-spending functions exist, including Pocock boundaries, Kim-DeMets, and Lan-DeMets. Pocock boundaries are symmetric across looks (same threshold at every interim point), which sounds appealing but means very little alpha remains for the final analysis — raising the final critical value well above 1.96 and costing substantial power. O'Brien-Fleming is preferred because it is very conservative early (when you should not be stopping on noise) and near-standard at the final look (preserving most power). The final look O'Brien-Fleming critical value at α = 0.05 with 5 looks is approximately 1.987 versus 1.960 for fixed-horizon — a trivial cost.

Frequently Asked Questions

Why can't I just check results every day and stop when p < 0.05?

Checking a standard p < 0.05 test repeatedly inflates your false positive rate far above 5%. Simulations show that peeking at results daily for 20 days produces a false positive on roughly 30% of tests even when there is no real effect. Sequential boundaries like O'Brien-Fleming maintain the overall false positive rate at exactly α by requiring higher evidence early in the test, when the signal-to-noise ratio is lowest.

What is an information fraction?

Information fraction is the ratio of data collected so far to the planned maximum sample. At day 7 of a 35-day test with equal daily traffic, information fraction = 7/35 = 0.20. The O'Brien-Fleming boundary at each look is scaled by the information fraction — earlier looks require more extreme Z-statistics to trigger stopping, because less data means more uncertainty about the true effect.

Does sequential testing require more total data?

The maximum sample is slightly larger (1–8% depending on number of looks). However, when a true effect exists, the expected sample size is substantially smaller because the test often stops early. If you want to be able to stop at day 7 of a 35-day test, you need to plan for a slightly larger maximum than a fixed-horizon design — but you will on average stop well before that maximum when your variant is genuinely better.

What happens if I don't stop at a look point?

Not stopping when the boundary is crossed is fine statistically — you simply continue the test. The error-rate guarantees still hold. What you should not do is make a business decision (ship the variant) without formally stopping the test, then continue collecting data. If you cross the boundary and decide to ship, record that day as the stopping point. Do not keep running the test hoping for more data.

Can I use this for non-conversion metrics like revenue per visitor?

O'Brien-Fleming boundaries apply to any test statistic that asymptotically follows a normal distribution — including t-tests for continuous metrics like revenue per visitor. The Z-statistic you compute must correspond to the same type of test (t-statistic or Z-statistic depending on sample size and variance assumptions). The boundaries generated here are on the Z-scale, which is appropriate for large samples where the t and Z distributions converge.

Related A/B Testing Tools

About the Author

Built by Michael Lip — Solo developer with 10+ years experience. 140+ PRs merged into open source projects including Google Chrome and Axios. Creator of 20+ Chrome extensions and the Zovo developer tools collection.

Related Tools