Statistical Power Analysis Calculator
Calculate required sample size for your A/B test with interactive power curves. Support for one-tailed and two-tailed tests, Cohen's h effect size, and multiple significance levels. All computation runs in your browser.
Configure Your Power Analysis
Understanding Statistical Power Analysis for A/B Testing
Statistical power analysis is the cornerstone of proper experiment design. Before launching any A/B test, you must answer a fundamental question: how many visitors do I need to detect a meaningful difference? Running a test without adequate sample size wastes time and resources, producing inconclusive results that cannot reliably guide decisions. Power analysis provides the mathematical framework to answer this question with precision.
The concept of statistical power was formalized by Jerzy Neyman and Egon Pearson in the 1930s as part of their hypothesis testing framework. Power is defined as 1 minus beta, where beta is the Type II error rate — the probability of failing to reject a false null hypothesis. In A/B testing terms, power is the probability that your test will detect a real improvement in conversion rate when one exists. The industry standard is 80% power, though some organizations use 90% for high-stakes experiments.
Power analysis involves four interconnected parameters: sample size (n), effect size (delta), significance level (alpha), and power (1 - beta). Given any three, you can solve for the fourth. In practice, most teams fix alpha at 0.05, power at 0.80, and the minimum detectable effect (MDE) at whatever is business-relevant, then solve for the required sample size. This calculator automates that process and visualizes the relationship between these parameters through interactive power curves.
The Mathematics Behind the Calculator
The sample size formula for comparing two proportions is derived from the normal approximation to the binomial distribution. For a two-tailed test, the formula is:
n = (Z_{alpha/2} + Z_{beta})^2 * [p1(1-p1) + p2(1-p2)] / (p2 - p1)^2
Where p1 is the baseline conversion rate, p2 is the expected variant rate (p1 * (1 + MDE/100)), Z_{alpha/2} is the critical value for the chosen significance level (1.96 for alpha = 0.05), and Z_{beta} is the critical value for the desired power (0.84 for 80% power). For a one-tailed test, Z_{alpha/2} is replaced with Z_{alpha} (1.645 for alpha = 0.05), which reduces the required sample size by approximately 20%.
The formula reveals several important relationships. Sample size is inversely proportional to the square of the effect size: halving the MDE quadruples the required traffic. Sample size increases as the baseline conversion rate moves away from 50% in either direction, because extreme proportions have lower variance per observation. And sample size increases roughly linearly with the sum (Z_{alpha/2} + Z_{beta})^2, meaning that increasing power from 80% to 90% adds about 30% more traffic.
Cohen's h: A Better Effect Size Measure
Raw differences between proportions can be misleading because the same absolute difference has different statistical properties at different baseline rates. Cohen's h addresses this by applying the arcsine transformation:
h = 2 * arcsin(sqrt(p2)) - 2 * arcsin(sqrt(p1))
This transformation stabilizes the variance, making h comparable across different baseline rates. Jacob Cohen proposed conventional benchmarks: h = 0.2 is a small effect, h = 0.5 is medium, and h = 0.8 is large. In A/B testing, most experiments target effects in the small to medium range (h between 0.05 and 0.3).
The practical value of Cohen's h is in comparing experiments across different contexts. A test on a checkout page with a 70% completion rate and a test on a sign-up page with a 3% conversion rate may both have meaningful business impact, but the raw percentage point differences are not directly comparable. Cohen's h normalizes these differences, allowing you to assess whether one experiment has a fundamentally larger effect than another.
One-Tailed vs. Two-Tailed Tests
The choice between one-tailed and two-tailed tests affects both the required sample size and the interpretation of results. A two-tailed test detects differences in either direction — it can identify both improvements and degradations. A one-tailed test only detects differences in one pre-specified direction, typically improvement.
Two-tailed tests are the standard recommendation for A/B testing because they protect against shipping changes that actually harm performance. If you run a one-tailed test looking for improvement, you cannot statistically detect a degradation. This is dangerous in practice: many A/B tests reveal that the variant is worse than the control, and this information is valuable for avoiding harmful changes.
The trade-off is sample size. One-tailed tests require approximately 20% fewer visitors to achieve the same power for the same effect size. This is because the critical value at alpha = 0.05 is 1.645 for one-tailed vs. 1.96 for two-tailed. For teams with severe traffic constraints, a one-tailed test may be acceptable if the business context truly makes one direction irrelevant — for example, testing whether a price increase reduces conversions (you would not ship a price increase that helps conversions without other business considerations).
Interpreting Power Curves
The power curve shows how statistical power changes as a function of sample size, holding effect size and significance level constant. The curve has a characteristic S-shape: power starts near alpha (the false positive rate) at very small sample sizes, rises steeply through the mid-range, and asymptotically approaches 1.0 at large sample sizes.
The inflection point of the power curve — where it rises most steeply — corresponds to the sample size range where additional data has the highest marginal value. Below this point, the test is severely underpowered and additional data provides modest improvement. Above this point, returns diminish rapidly. The optimal sample size usually falls slightly above the inflection point, at the conventionally chosen 80% power level.
Reading the power chart in this calculator, the vertical dashed line marks your calculated required sample size, and the horizontal dashed line marks your target power level. The intersection of these lines confirms the calculation. You can visually assess how much additional traffic would be needed to reach 90% or 95% power, or how much power you would sacrifice by running with fewer visitors.
Common Pitfalls in Power Analysis
The most frequent mistake is choosing an unrealistically small MDE to justify a short test. If your test can only run for two weeks and you have 500 daily visitors, the achievable sample size is 7,000 per variant. At a 5% baseline, this only gives 80% power to detect a 25% relative MDE or larger. Forcing a 5% MDE into this constraint requires 112,000 visitors per variant — a 32-week test. Accept the MDE your traffic supports or do not run the test.
Another common error is ignoring the multiple comparisons problem. If you test 5 metrics simultaneously, your effective Type I error rate is approximately 1 - (1 - 0.05)^5 = 22.6%, not 5%. Power analysis should account for the Bonferroni correction or similar adjustment, which effectively reduces alpha and increases the required sample size. This calculator focuses on single-metric power analysis; adjust your alpha input manually for multiple comparisons.
Finally, power analysis assumes fixed sample sizes and no early stopping. If you plan to peek at results before the test concludes, you need sequential testing methods (group sequential designs or always-valid p-values) that adjust the significance threshold over time. Using a fixed-sample power calculation with interim analyses inflates the actual false positive rate substantially.
Practical Guidelines for Choosing Parameters
For significance level, 0.05 is the industry default. Use 0.01 for high-stakes tests where false positives are costly (pricing changes, core product modifications). Use 0.10 for exploratory tests where you are willing to accept more false positives in exchange for detecting smaller effects with less traffic.
For power, 0.80 is standard. Use 0.90 when the cost of missing a real effect is high — for example, when testing a change that took months to develop and cannot easily be re-tested. Use 0.70 only for early-stage screening tests where you plan to follow up with a confirmatory test.
For MDE, think in terms of business impact. Calculate the smallest improvement that would justify shipping the change, accounting for implementation cost, maintenance burden, and opportunity cost. If a 2% relative lift in conversion rate translates to $500/month in additional revenue but the implementation costs $10,000, the break-even point is 20 months — probably not worth it. Set your MDE at the threshold where the economics become clearly favorable.
Frequently Asked Questions
What is statistical power in A/B testing?
Statistical power is the probability that your A/B test will correctly detect a real difference between variants when one exists. A power of 80% means that if variant B truly outperforms variant A, your test will detect it 80% of the time. The remaining 20% is the false negative rate (Type II error). Power depends on sample size, effect size, significance level, and metric variance. Use this calculator to determine the sample size needed for your target power.
What is Cohen's h and why does it matter for A/B testing?
Cohen's h is an effect size measure for comparing two proportions, calculated as h = 2 * arcsin(sqrt(p2)) - 2 * arcsin(sqrt(p1)). It normalizes the difference so that effects at different baseline rates are comparable. Values of 0.2, 0.5, and 0.8 represent small, medium, and large effects. Most A/B tests target effects in the 0.05 to 0.3 range.
Should I use a one-tailed or two-tailed test for my A/B test?
Use a two-tailed test in most cases. It detects both improvements and degradations, protecting you from shipping changes that hurt performance. One-tailed tests require about 20% less traffic but cannot detect harmful effects. Only use one-tailed when you genuinely do not care about detecting a negative impact.
Why does power increase with sample size?
Larger samples reduce the standard error of the estimated conversion rate proportionally to 1/sqrt(n), narrowing the sampling distribution and making it easier to distinguish real effects from noise. The relationship follows an S-curve — power rises steeply in the mid-range and flattens near 100%. Doubling your sample size does not double your power.
What is the relationship between significance level and power?
They are inversely related at fixed sample size. Stricter alpha (e.g., 0.01 instead of 0.05) reduces false positives but also reduces power. To maintain 80% power at alpha = 0.01, you need roughly 50% more traffic than at alpha = 0.05. This calculator lets you compare power curves at different significance levels to find the right balance.
Related A/B Testing Tools
- Sample Size Calculator — Quick sample size estimation for A/B tests
- Bayesian A/B Testing Guide — Learn the Beta-Binomial method for A/B testing
- Bayesian A/B Test Calculator — Posterior distribution visualization and probability of winning
- P-Value Calculator — Calculate statistical significance for your test results
About the Author
Built by Michael Lip — Solo developer with 10+ years experience. 140+ PRs merged into open source projects including Google Chrome and Axios. Creator of 20+ Chrome extensions and the Zovo developer tools collection.