Statistical Significance Explained
Explore p-values interactively with distribution overlap visualization, challenge your understanding with a 10-question misconceptions quiz, and calculate effect sizes with Cohen's d, h, and w. All computation runs in your browser.
P-Value Interactive Explorer
Adjust sample size, effect size, and alpha to see how two distributions overlap and where the p-value region falls.
P-Value Misconceptions Quiz
Test your understanding of statistical significance. 10 questions on common p-value myths and misinterpretations.
Confidence Interval Visualizer
Enter your sample data and visualize confidence intervals at different levels. See how sample size and variance affect interval width.
Effect Size Calculator
Calculate Cohen's d (means), Cohen's h (proportions), and Cohen's w (chi-square) with magnitude interpretation.
Standardized mean difference between two groups.
What Statistical Significance Really Means
Statistical significance is one of the most important and most misunderstood concepts in data-driven decision making. At its core, statistical significance is a statement about the compatibility of observed data with a null hypothesis. When a result is "statistically significant at the 0.05 level," it means that if the null hypothesis were true, there would be less than a 5% probability of observing data as extreme as what was collected. That is the entire claim. It says nothing about the probability that the null hypothesis is true, nothing about the size of the effect, and nothing about whether the result matters in practice.
The concept was formalized in the early 20th century by Ronald Fisher, Jerzy Neyman, and Egon Pearson, though their frameworks were fundamentally different. Fisher viewed the p-value as a continuous measure of evidence against the null hypothesis, while Neyman and Pearson developed the hypothesis testing framework with fixed error rates (Type I and Type II errors). Modern practice awkwardly merges both approaches, using Fisher's p-values within Neyman-Pearson's decision framework, which creates confusion that persists to this day.
The Anatomy of a P-Value
A p-value is calculated in three steps. First, you define a null hypothesis, typically that there is no effect or no difference. Second, you compute a test statistic that summarizes how far your observed data deviate from what the null hypothesis predicts. Third, you calculate the probability of observing a test statistic at least as extreme as yours, assuming the null hypothesis is true. This probability is the p-value.
The p-value explorer above visualizes this process. The gray curve represents the null hypothesis distribution, which is the sampling distribution of the test statistic if there is truly no effect. The teal shaded region represents values more extreme than the critical value. When the observed test statistic falls in this region, the p-value is below alpha and the result is deemed statistically significant. The orange curve represents the alternative hypothesis, showing where the test statistic would likely fall if the true effect equals Cohen's d. The overlap between these curves determines statistical power.
Understanding this geometry is essential. When you increase the sample size, both distributions become narrower (lower standard error), which separates them further and increases power. When you increase the effect size, the alternative distribution shifts further from the null, again increasing separation. When you decrease alpha, the critical value moves outward, making significance harder to achieve but reducing false positives.
Ten Things P-Values Do NOT Tell You
The misconceptions quiz in this tool targets the ten most common misinterpretations of p-values, drawn from research by Goodman (2008), Greenland et al. (2016), and the American Statistical Association's 2016 statement on p-values. Here is a summary of what p-values do not mean:
First, the p-value is not the probability that the null hypothesis is true. This is the most pervasive error. The p-value is P(data | H0), not P(H0 | data). Converting from one to the other requires Bayes' theorem and a prior probability for H0, which the frequentist framework does not provide. Second, 1 minus the p-value is not the probability that the alternative hypothesis is true. Third, a p-value of 0.05 does not mean there is a 5% chance the result is a false positive. The false positive rate depends on the base rate of true effects and the power of the study, not just alpha.
Fourth, a non-significant p-value does not mean there is no effect. It means the data are compatible with the null hypothesis, but they may also be compatible with meaningful effects, especially if the study was underpowered. Fifth, a significant p-value does not mean the effect is large or important. Sixth, two studies with p = 0.04 and p = 0.06 do not provide meaningfully different evidence. Seventh, the p-value does not measure the reliability or replicability of a finding. Eighth, "highly significant" results (p = 0.001) are not necessarily more important than marginally significant ones (p = 0.04). Ninth, p-values from different studies are not directly comparable without accounting for power and effect sizes. Tenth, the p-value does not account for multiple comparisons, data dredging, or selective reporting unless explicitly adjusted.
Confidence Intervals: The Better Communicator
Confidence intervals convey the same statistical information as p-values but in a more interpretable format. A 95% confidence interval provides a range of parameter values that are compatible with the observed data. If the interval excludes the null hypothesis value, the result is significant at the 0.05 level. But the interval also communicates the precision of the estimate and the range of plausible effect sizes, something a p-value alone cannot do.
The confidence interval visualizer in this tool shows how the interval width depends on sample size, variability, and confidence level. The relationship is governed by the formula: CI = mean plus or minus z * (SD / sqrt(n)). Doubling the sample size narrows the interval by a factor of sqrt(2), roughly 29%. Increasing from 95% to 99% confidence widens the interval by about 30% because the critical z-value increases from 1.96 to 2.576.
A common misinterpretation of confidence intervals mirrors the p-value confusion: a 95% CI does not mean there is a 95% probability the true parameter lies within the interval. It means that if you repeated the experiment many times, 95% of the constructed intervals would contain the true parameter. Any single interval either contains the true value or it does not. This frequentist interpretation is admittedly awkward, which is why Bayesian credible intervals, which do provide direct probability statements about parameters, are gaining popularity.
Effect Sizes: The Missing Piece
Effect sizes answer the question that p-values cannot: how large is the observed difference? Cohen's d, h, and w are standardized effect size measures that allow comparisons across different scales and contexts. Cohen's d divides the mean difference by the pooled standard deviation, yielding a scale-free measure. A Cohen's d of 0.5 means the two groups differ by half a standard deviation, regardless of whether you are measuring response time in milliseconds or revenue in dollars.
Cohen's h applies the arcsine transformation to proportions before computing the difference. This is necessary because the variance of a proportion depends on the proportion itself, so the same absolute difference (e.g., 3 percentage points) has different statistical properties at a 5% baseline versus a 50% baseline. Cohen's h normalizes this, making effects comparable across different baseline rates. In A/B testing, this is particularly valuable when comparing experiments across pages with very different conversion rates.
Cohen's w is the effect size for chi-square tests. It is defined as the square root of the chi-square statistic divided by the sample size: w = sqrt(chi-square / n). Like d and h, it has conventional thresholds: 0.1 (small), 0.3 (medium), and 0.5 (large). The effect size calculator in this tool computes all three measures and provides magnitude interpretations based on Cohen's 1988 benchmarks, while noting that these benchmarks are domain-dependent and should be calibrated to your specific field.
Practical Recommendations for A/B Testers
Report p-values alongside confidence intervals and effect sizes. Never make decisions based on p-values alone. Pre-register your hypothesis, sample size, and analysis plan before running the test. Use power analysis to ensure your test can detect the minimum effect size that matters to your business. Adjust for multiple comparisons when testing multiple metrics or variants. Consider Bayesian methods when you need direct probability statements about your hypotheses. And always remember that statistical significance is a necessary but insufficient condition for practical significance.
Frequently Asked Questions
What does a p-value actually tell you?
A p-value is the probability of observing data at least as extreme as the collected data, assuming the null hypothesis is true. It does NOT tell you the probability that the null hypothesis is true, nor the probability that your result occurred by chance. A p-value of 0.03 means that if there were truly no effect, you would see data this extreme only 3% of the time.
Is a p-value of 0.05 a reliable threshold for significance?
The 0.05 threshold is a convention, not a law of nature. A p-value of 0.049 and 0.051 represent nearly identical evidence. Many fields now recommend reporting exact p-values with effect sizes and confidence intervals. Some journals require alpha of 0.005 for claims of new discoveries.
What is the difference between statistical significance and practical significance?
Statistical significance means the observed effect is unlikely due to chance. Practical significance means the effect is large enough to matter. With a large enough sample, trivially small effects become statistically significant. Always pair p-values with effect size measures.
Why do researchers misinterpret p-values so often?
P-value misinterpretation is widespread because the definition requires reasoning about the probability of data given a hypothesis, not the probability of a hypothesis given data (which is what most people want to know). Bayesian methods directly answer the question most people are actually asking.
How do effect sizes like Cohen's d complement p-values?
Effect sizes quantify how large an observed difference is, independent of sample size. Cohen's d of 0.5 means the groups differ by half a standard deviation. Unlike p-values, effect sizes do not shrink with more data. Reporting both gives a complete picture: the p-value addresses whether the effect is real, and the effect size addresses whether it matters.
Related A/B Testing Tools
- P-Value Calculator — Calculate statistical significance for your test results
- Power Analysis Calculator — Determine required sample size with power curves
- Confidence Interval Calculator — Compute CIs for proportions and means
- Chi-Square Test Calculator — Test independence and goodness-of-fit
About the Author
Built by Michael Lip — Solo developer with 10+ years experience. 140+ PRs merged into open source projects including Google Chrome and Axios. Creator of 20+ Chrome extensions and the Zovo developer tools collection.