Statistical Significance Explained

Q: Is a p-value of 0.05 a reliable threshold for significance?

The 0.05 threshold is a convention, not a law of nature. Ronald Fisher originally proposed it as a convenient reference point, not a rigid cutoff. A p-value of 0.049 and 0.051 represent nearly identical evidence, yet one is 'significant' and the other is not. Many fields are moving toward reporting exact p-values with effect sizes and confidence intervals rather than relying on binary significant/not-significant decisions. Some journals now require alpha of 0.005 for claims of new discoveries.

Explore p-values interactively with distribution overlap visualization, challenge your understanding with a 10-question misconceptions quiz, and calculate effect sizes with Cohen's d, h, and w. All computation runs in your browser.

P-Value Interactive Explorer

Adjust sample size, effect size, and alpha to see how two distributions overlap and where the p-value region falls.

Sample Size (per group)

True Effect Size (Cohen's d)

Significance Level (alpha)

Test Type

Distribution Overlap — Null vs. Alternative Hypothesis

Observed Test Statistic (z)--

P-Value--

Critical Value--

Statistical Power--

Decision--

The shaded teal region shows where p-values fall below alpha. The orange curve is the alternative hypothesis distribution. Greater separation means higher power and smaller p-values.

P-Value Misconceptions Quiz

Test your understanding of statistical significance. 10 questions on common p-value myths and misinterpretations.

Question 1 of 10

Confidence Interval Visualizer

Enter your sample data and visualize confidence intervals at different levels. See how sample size and variance affect interval width.

Sample Mean

Sample Standard Deviation

Sample Size

Null Hypothesis Value

Confidence Intervals at 90%, 95%, and 99%

90% CI--

95% CI--

99% CI--

Standard Error--

Margin of Error (95%)--

Wider intervals mean more uncertainty. If the null hypothesis value (dashed line) falls outside the confidence interval, the result is significant at that level.

Effect Size Calculator

Calculate Cohen's d (means), Cohen's h (proportions), and Cohen's w (chi-square) with magnitude interpretation.

Standardized mean difference between two groups.

Group 1 Mean

Group 2 Mean

Group 1 SD

Group 2 SD

Effect Size--

Magnitude--

Interpretation--

Cohen's d: 0.2 = small, 0.5 = medium, 0.8 = large. Cohen's h: same thresholds via arcsine transform. Cohen's w: 0.1 = small, 0.3 = medium, 0.5 = large.

What Statistical Significance Really Means

Statistical significance is one of the most important and most misunderstood concepts in data-driven decision making. At its core, statistical significance is a statement about the compatibility of observed data with a null hypothesis. When a result is "statistically significant at the 0.05 level," it means that if the null hypothesis were true, there would be less than a 5% probability of observing data as extreme as what was collected. That is the entire claim. It says nothing about the probability that the null hypothesis is true, nothing about the size of the effect, and nothing about whether the result matters in practice.

The concept was formalized in the early 20th century by Ronald Fisher, Jerzy Neyman, and Egon Pearson, though their frameworks were fundamentally different. Fisher viewed the p-value as a continuous measure of evidence against the null hypothesis, while Neyman and Pearson developed the hypothesis testing framework with fixed error rates (Type I and Type II errors). Modern practice awkwardly merges both approaches, using Fisher's p-values within Neyman-Pearson's decision framework, which creates confusion that persists to this day.

The Anatomy of a P-Value

A p-value is calculated in three steps. First, you define a null hypothesis, typically that there is no effect or no difference. Second, you compute a test statistic that summarizes how far your observed data deviate from what the null hypothesis predicts. Third, you calculate the probability of observing a test statistic at least as extreme as yours, assuming the null hypothesis is true. This probability is the p-value.

The p-value explorer above visualizes this process. The gray curve represents the null hypothesis distribution, which is the sampling distribution of the test statistic if there is truly no effect. The teal shaded region represents values more extreme than the critical value. When the observed test statistic falls in this region, the p-value is below alpha and the result is deemed statistically significant. The orange curve represents the alternative hypothesis, showing where the test statistic would likely fall if the true effect equals Cohen's d. The overlap between these curves determines statistical power.

Understanding this geometry is essential. When you increase the sample size, both distributions become narrower (lower standard error), which separates them further and increases power. When you increase the effect size, the alternative distribution shifts further from the null, again increasing separation. When you decrease alpha, the critical value moves outward, making significance harder to achieve but reducing false positives.

Ten Things P-Values Do NOT Tell You

The misconceptions quiz in this tool targets the ten most common misinterpretations of p-values, drawn from research by Goodman (2008), Greenland et al. (2016), and the American Statistical Association's 2016 statement on p-values. Here is a summary of what p-values do not mean:

First, the p-value is not the probability that the null hypothesis is true. This is the most pervasive error. The p-value is P(data | H0), not P(H0 | data). Converting from one to the other requires Bayes' theorem and a prior probability for H0, which the frequentist framework does not provide. Second, 1 minus the p-value is not the probability that the alternative hypothesis is true. Third, a p-value of 0.05 does not mean there is a 5% chance the result is a false positive. The false positive rate depends on the base rate of true effects and the power of the study, not just alpha.

Fourth, a non-significant p-value does not mean there is no effect. It means the data are compatible with the null hypothesis, but they may also be compatible with meaningful effects, especially if the study was underpowered. Fifth, a significant p-value does not mean the effect is large or important. Sixth, two studies with p = 0.04 and p = 0.06 do not provide meaningfully different evidence. Seventh, the p-value does not measure the reliability or replicability of a finding. Eighth, "highly significant" results (p = 0.001) are not necessarily more important than marginally significant ones (p = 0.04). Ninth, p-values from different studies are not directly comparable without accounting for power and effect sizes. Tenth, the p-value does not account for multiple comparisons, data dredging, or selective reporting unless explicitly adjusted.

Confidence Intervals: The Better Communicator

Confidence intervals convey the same statistical information as p-values but in a more interpretable format. A 95% confidence interval provides a range of parameter values that are compatible with the observed data. If the interval excludes the null hypothesis value, the result is significant at the 0.05 level. But the interval also communicates the precision of the estimate and the range of plausible effect sizes, something a p-value alone cannot do.

The confidence interval visualizer in this tool shows how the interval width depends on sample size, variability, and confidence level. The relationship is governed by the formula: CI = mean plus or minus z * (SD / sqrt(n)). Doubling the sample size narrows the interval by a factor of sqrt(2), roughly 29%. Increasing from 95% to 99% confidence widens the interval by about 30% because the critical z-value increases from 1.96 to 2.576.

A common misinterpretation of confidence intervals mirrors the p-value confusion: a 95% CI does not mean there is a 95% probability the true parameter lies within the interval. It means that if you repeated the experiment many times, 95% of the constructed intervals would contain the true parameter. Any single interval either contains the true value or it does not. This frequentist interpretation is admittedly awkward, which is why Bayesian credible intervals, which do provide direct probability statements about parameters, are gaining popularity.

Effect Sizes: The Missing Piece

Effect sizes answer the question that p-values cannot: how large is the observed difference? Cohen's d, h, and w are standardized effect size measures that allow comparisons across different scales and contexts. Cohen's d divides the mean difference by the pooled standard deviation, yielding a scale-free measure. A Cohen's d of 0.5 means the two groups differ by half a standard deviation, regardless of whether you are measuring response time in milliseconds or revenue in dollars.

Cohen's h applies the arcsine transformation to proportions before computing the difference. This is necessary because the variance of a proportion depends on the proportion itself, so the same absolute difference (e.g., 3 percentage points) has different statistical properties at a 5% baseline versus a 50% baseline. Cohen's h normalizes this, making effects comparable across different baseline rates. In A/B testing, this is particularly valuable when comparing experiments across pages with very different conversion rates.

Cohen's w is the effect size for chi-square tests. It is defined as the square root of the chi-square statistic divided by the sample size: w = sqrt(chi-square / n). Like d and h, it has conventional thresholds: 0.1 (small), 0.3 (medium), and 0.5 (large). The effect size calculator in this tool computes all three measures and provides magnitude interpretations based on Cohen's 1988 benchmarks, while noting that these benchmarks are domain-dependent and should be calibrated to your specific field.

Practical Recommendations for A/B Testers

Report p-values alongside confidence intervals and effect sizes. Never make decisions based on p-values alone. Pre-register your hypothesis, sample size, and analysis plan before running the test. Use power analysis to ensure your test can detect the minimum effect size that matters to your business. Adjust for multiple comparisons when testing multiple metrics or variants. Consider Bayesian methods when you need direct probability statements about your hypotheses. And always remember that statistical significance is a necessary but insufficient condition for practical significance.

Last updated: May 25, 2026

Frequently Asked Questions

What does a p-value actually tell you?

A p-value is the probability of observing data at least as extreme as the collected data, assuming the null hypothesis is true. It does NOT tell you the probability that the null hypothesis is true, nor the probability that your result occurred by chance. A p-value of 0.03 means that if there were truly no effect, you would see data this extreme only 3% of the time.

Is a p-value of 0.05 a reliable threshold for significance?

The 0.05 threshold is a convention, not a law of nature. A p-value of 0.049 and 0.051 represent nearly identical evidence. Many fields now recommend reporting exact p-values with effect sizes and confidence intervals. Some journals require alpha of 0.005 for claims of new discoveries.

What is the difference between statistical significance and practical significance?

Statistical significance means the observed effect is unlikely due to chance. Practical significance means the effect is large enough to matter. With a large enough sample, trivially small effects become statistically significant. Always pair p-values with effect size measures.

Why do researchers misinterpret p-values so often?

P-value misinterpretation is widespread because the definition requires reasoning about the probability of data given a hypothesis, not the probability of a hypothesis given data (which is what most people want to know). Bayesian methods directly answer the question most people are actually asking.

How do effect sizes like Cohen's d complement p-values?

Effect sizes quantify how large an observed difference is, independent of sample size. Cohen's d of 0.5 means the groups differ by half a standard deviation. Unlike p-values, effect sizes do not shrink with more data. Reporting both gives a complete picture: the p-value addresses whether the effect is real, and the effect size addresses whether it matters.

Related A/B Testing Tools

P-Value Calculator — Calculate statistical significance for your test results
Power Analysis Calculator — Determine required sample size with power curves
Confidence Interval Calculator — Compute CIs for proportions and means
Chi-Square Test Calculator — Test independence and goodness-of-fit

About the Author

Built by Michael Lip — Solo developer with 10+ years experience. 140+ PRs merged into open source projects including Google Chrome and Axios. Creator of 20+ Chrome extensions and the Zovo developer tools collection.