Chi-Square Test Calculator
Enter observed frequencies in the table below. The calculator computes expected frequencies, the chi-square statistic, degrees of freedom, and exact p-value. Adjust the table size from 2×2 to 5×5. All computation runs in your browser — no data is sent anywhere.
Observed Frequency Table
How the Chi-Square Test of Independence Works
The chi-square test of independence determines whether two categorical variables are associated or independent. The null hypothesis states that the variables are independent — knowing the value of one variable tells you nothing about the distribution of the other. The alternative hypothesis says an association exists. The test statistic measures the discrepancy between what you observed and what you would expect if independence were true.
For each cell in your contingency table, the expected frequency is computed as: Eij = (row totali × column totalj) / grand total. This is the count you would expect if the row and column variables were perfectly independent. The chi-square statistic sums the squared standardized deviations across all cells: χ² = ∑ (Oij − Eij)2 / Eij. Large values of χ² indicate large departures from independence.
The degrees of freedom for an r×c contingency table are (r − 1) × (c − 1). For a 2×2 table, df = 1. For a 3×4 table, df = 6. Under the null hypothesis, the test statistic follows a chi-square distribution with these degrees of freedom, and the p-value is the probability of observing a χ² value as large or larger if independence were true.
Interpreting P-Values and Critical Values
The p-value tells you the probability of obtaining your observed data (or more extreme data) under the assumption of independence. A small p-value means the observed pattern is unlikely to arise by chance alone. The most common significance threshold is α = 0.05: if p < 0.05, you reject the null hypothesis of independence and conclude the variables are associated.
The critical value is the threshold your χ² statistic must exceed to reject H0. For df = 1 at α = 0.05, the critical value is 3.841. For df = 4 at α = 0.05, it is 9.488. Equivalently, if χ² > critical value, then p < α. Both comparisons give identical decisions — they are two ways of expressing the same calculation.
The bar chart visualizes the magnitude of discrepancy for each cell. Cells where observed counts far exceed expected counts drive the chi-square statistic upward. Identifying which cells contribute most to the overall χ² helps you understand the nature of the association — not just whether one exists, but where it is concentrated.
Effect Size: Cramér's V
Statistical significance does not measure practical importance. With a large sample, even a tiny association produces a statistically significant result. Cramér's V corrects for this by normalizing the chi-square statistic: V = √(χ² / (N × min(r−1, c−1))). V ranges from 0 (no association) to 1 (perfect association). Conventional benchmarks for effect size are: small ≈ 0.10, medium ≈ 0.30, large ≈ 0.50. These thresholds apply to 2×2 tables; for larger tables the benchmarks shift slightly.
In A/B testing contexts, always report both the p-value and Cramér's V. A p-value of 0.001 with V = 0.04 means you have detected a real but practically irrelevant association — perhaps because your sample size is enormous. Conversely, V = 0.35 with p = 0.08 means a potentially meaningful pattern that your sample size was too small to confirm statistically.
When the Chi-Square Approximation Fails
The chi-square distribution is an approximation that becomes accurate when expected frequencies are not too small. The standard rule of thumb is that all expected cell frequencies should be 5 or greater. When this condition is violated, the chi-square p-value is unreliable — typically anticonservative, meaning it reports more significance than truly exists. This calculator flags cells with expected frequency below 5 in orange to alert you to this issue.
When you have a 2×2 table with small expected frequencies, consider applying Yates' continuity correction, using Fisher's exact test, or collecting more data before running the test. For larger tables, collapsing categories with small counts is often appropriate. The key insight is that the chi-square test requires adequate data in every cell, not just in total — a large N does not protect against small expected frequencies in sparse cells.
Chi-Square Test vs. Z-Test for A/B Tests
For a standard A/B test comparing two conversion rates, both the two-proportion Z-test and the chi-square test of independence give identical results. The chi-square statistic equals the Z-statistic squared: χ² = Z². The p-values are identical. The Z-test is more commonly reported because it supports directional (one-tailed) hypotheses and more directly expresses the difference between proportions. Use the chi-square test when you have more than two variants, when your outcome variable has more than two categories, or when comparing distributions across multiple groups simultaneously.
Frequently Asked Questions
What does it mean when p < 0.05 in a chi-square test?
It means that if the two variables were truly independent, you would observe data as extreme as yours less than 5% of the time. You reject the null hypothesis of independence and conclude that an association exists between the row and column variables. The threshold 0.05 is a convention — the right threshold depends on how much false-positive risk you are willing to accept for your specific decision.
How are degrees of freedom calculated?
For an r×c contingency table, degrees of freedom = (r − 1) × (c − 1). A 2×2 table has 1 degree of freedom. A 3×3 table has 4. Degrees of freedom reflect how many cell counts can vary freely once the row and column totals are fixed — the remaining cells are determined by the constraint that row and column sums must match the observed marginals.
Why are some expected frequencies flagged in orange?
The chi-square approximation is unreliable when expected frequencies fall below 5. Small expected frequencies cause the p-value to be inaccurate — typically too small (false significance). When this happens, consider collecting more data, merging sparse categories, or using Fisher's exact test for 2×2 tables.
What is Cramér's V and how large should it be?
Cramér's V measures the strength of association, independent of sample size. It ranges from 0 (no association) to 1 (perfect association). Conventional thresholds for 2×2 tables: small = 0.10, medium = 0.30, large = 0.50. Statistical significance (p < 0.05) only tells you an association is real; Cramér's V tells you how strong it is. Both metrics are needed for a complete interpretation.
When should I use chi-square instead of a Z-test?
Use chi-square when you have more than 2 variants, when your outcome has more than 2 categories, or when comparing distributions across multiple groups simultaneously. For a simple 2-variant, binary-outcome A/B test, both methods give identical p-values. The Z-test is preferred when you need directional (one-tailed) inference or want to directly express the difference between proportions.
Related A/B Testing Tools
- P-Value Calculator — Two-proportion Z-test with significance interpretation
- Sample Size Calculator — Determine required visitors before launching your test
- Bayesian A/B Calculator — Posterior distributions and probability of winning
- Confidence Interval Calculator — Wilson and Clopper-Pearson intervals
- Sequential Testing Tool — Stop A/B tests early with O'Brien-Fleming boundaries
About the Author
Built by Michael Lip — Solo developer with 10+ years experience. 140+ PRs merged into open source projects including Google Chrome and Axios. Creator of 20+ Chrome extensions and the Zovo developer tools collection.