P-Value Calculator for A/B Tests

Calculate p-values using the two-proportion Z-test. Understand what your p-value actually means and avoid the most common misinterpretations in A/B testing.

What P-Values Mean in A/B Testing

The p-value quantifies the strength of evidence against the null hypothesis — the assumption that both variants have the same true conversion rate. Formally, it is the probability of observing a test statistic at least as extreme as the one computed from your data, given that the null hypothesis is true. A small p-value (below your pre-set threshold) indicates that the observed data would be unlikely under the null hypothesis, providing grounds to reject it.

Critically, the p-value is not the probability that your hypothesis is true or false. It does not tell you the magnitude of the effect, the practical significance of the result, or the probability that the variant is better than the control. These are separate questions that require additional analysis such as confidence intervals, effect size estimation, or Bayesian posterior probabilities.

The Two-Proportion Z-Test Formula

For comparing two conversion rates, the Z-test statistic is computed as:

Z = (p1 - p2) / sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))

Where p1 and p2 are the observed conversion rates for control and variant, n1 and n2 are the sample sizes, and p_pool is the pooled conversion rate: (x1 + x2) / (n1 + n2), where x1 and x2 are the number of conversions. The pooled estimate is used because under the null hypothesis, both groups share the same true rate.

The Z-score follows a standard normal distribution under the null hypothesis. The p-value is computed by looking up the Z-score in the normal CDF. For a two-tailed test, p = 2 * (1 - Phi(|Z|)), where Phi is the standard normal cumulative distribution function. ABWex uses the Abramowitz and Stegun rational approximation for the normal CDF, which has an error of less than 7.5 * 10^-8.

Common Significance Thresholds

The significance level (alpha) is the threshold below which you reject the null hypothesis. It represents the maximum false positive rate you are willing to accept. The three most common thresholds in practice are:

P-Value Misconceptions

The p-value is the single most misunderstood concept in applied statistics. Here are the five most common misconceptions and their corrections:

Misconception 1: The p-value is the probability that the null hypothesis is true. The p-value assumes the null is true and tells you how surprising your data is under that assumption. It says nothing about the probability of the hypothesis itself. That requires Bayesian analysis with a prior distribution.

Misconception 2: A p-value of 0.03 means there is a 97% chance the variant is better. This confuses P(data | null) with P(null | data). The p-value is the former. The latter requires Bayes' theorem and depends on your prior beliefs about the effect size.

Misconception 3: A non-significant result means there is no effect. Absence of evidence is not evidence of absence. A high p-value may simply mean your sample size was too small to detect the effect. Check your statistical power before concluding "no effect."

Misconception 4: A smaller p-value means a larger effect. P-values conflate effect size with sample size. A tiny, meaningless difference can produce a very small p-value with enough data. Always report confidence intervals alongside p-values to communicate effect magnitude.

Misconception 5: You can set your significance threshold after seeing the results. This is called post-hoc threshold adjustment and invalidates the entire frequentist framework. The threshold must be set before collecting data. Choosing alpha = 0.06 because your p-value was 0.058 is not valid statistics.

One-Tailed vs Two-Tailed Tests

A two-tailed test evaluates whether variant B is different from control A in either direction — it could be better or worse. The rejection region is split between both tails of the distribution. A one-tailed test evaluates only one direction — for example, whether B is strictly better than A.

Two-tailed tests are the default in most A/B testing scenarios because you need to know if your change harmed performance. One-tailed tests are appropriate only when a negative result would not change your decision. For example, if you will either ship variant B or keep A, and there is no "B is worse" scenario that matters, a one-tailed test is defensible. The one-tailed p-value is exactly half the two-tailed p-value, making it easier to reach significance — which is precisely why you should be cautious about using it.

Frequently Asked Questions

What does p-value mean in A/B testing?

The p-value is the probability of observing a difference as large as your actual result, assuming there is no real difference between variants. A p-value of 0.03 means there is a 3% probability of seeing this result if both variants have identical true conversion rates. It is NOT the probability that the variant is better — for that, use Bayesian analysis.

What is a good p-value for an A/B test?

The standard threshold is 0.05 (5%). More conservative teams use 0.01 for high-stakes decisions. Some teams use 0.10 for exploratory tests. The key is to set your threshold before running the test, not after seeing results.

What is the difference between one-tailed and two-tailed p-values?

A two-tailed test checks for differences in either direction (better or worse). A one-tailed test checks only one direction. Two-tailed p-values are double the one-tailed values. Use two-tailed by default unless a negative result would not change your decision.

Why is my p-value not significant even though the conversion rates look different?

Visual differences can be statistically insignificant with small sample sizes. A difference between 5% and 8% with 100 visitors per variant has p = 0.38. The same difference with 10,000 visitors has p < 0.001. Use the sample size calculator to plan your test properly.

Can a p-value tell me the probability that B is better than A?

No. This is the most common p-value misconception. A p-value of 0.03 does NOT mean 97% probability that B is better. To get the actual probability that B beats A, use Bayesian analysis with ABWex's Bayesian mode, which directly computes P(B > A).

Related A/B Testing Tools

About the Author

Built by Michael Lip — Solo developer with 10+ years experience. 140+ PRs merged into open source projects including Google Chrome and Axios. Creator of 20+ Chrome extensions and the Zovo developer tools collection.