Every statistic in this calculator is computed from the formulas below - fully open, fully transparent, no black boxes.
Test Statistic
Two-proportion Z-test
The Z-test compares conversion rates between control (rate p₁) and variant (rate p₂):
Pooled ratep̂ = (c₁ + c₂) / (n₁ + n₂)
Pooled standard errorSE = √[ p̂·(1−p̂) · (1/n₁ + 1/n₂) ]
Z-statisticZ = (p₂ − p₁) / SE
Two-sided p-valuep = 2 · (1 − Φ(|Z|))
Where Φ is the standard normal CDF. A p-value below your chosen alpha (typically 0.05) means the observed difference is unlikely to be random noise - your variant truly performs differently.
Confidence Intervals
Wilson score interval
For each individual rate, this calculator uses the Wilson score interval - more accurate than the standard Wald interval, especially for small samples or extreme rates near 0% or 100%:
Wilson 95% CI(p̂ + z²/2n ± z·√[p̂(1−p̂)/n + z²/4n²]) / (1 + z²/n)
Where z is the critical value for your confidence level (1.96 for 95%, 2.576 for 99%). Wilson intervals never produce nonsensical bounds outside [0, 1] and remain reliable even with as few as 5–10 conversions.
The Wald interval (textbook standard) can give intervals like −2% to 8% for low-traffic variants, which is mathematically meaningless. Wilson fixes this.
Statistical Power
Power analysis
Power is the probability of correctly detecting a real effect when one exists. Industry standard: target 80% power. Below that, you risk false negatives (Type II errors) - concluding "no effect" when the variant actually works.
Achieved power1 − β = Φ(|p₂−p₁|/SE − zα/2) + Φ(−|p₂−p₁|/SE − zα/2)
Required sample size (n per variant)n = ((zα/2·√[2·p₁(1−p₁)] + zβ·√[p₁(1−p₁) + p₂(1−p₂)])²) / (p₂−p₁)²
Smaller effects (smaller MDE) and lower baseline rates require dramatically larger samples. Detecting a 1% absolute lift on a 5% baseline needs roughly 4× the traffic of detecting a 2% lift on the same baseline.
Multiple Comparisons
Bonferroni correction
When you test multiple variants against control simultaneously, you multiply your false-positive risk. Three variants at 95% confidence give roughly a 14% chance of at least one false positive - not 5%.
Adjusted alphaαadj = α / k
Where k is the number of variants compared to control. This calculator applies Bonferroni automatically when you select A/B/C (k=2) or A/B/C/D (k=3). The threshold is divided by k, making each individual test stricter so the overall family-wise error rate stays at α.
Bonferroni is conservative - it's the safest correction for production decisions. More powerful methods exist (Holm-Bonferroni, Benjamini-Hochberg) but require careful interpretation; we default to the simple, safe choice.
Sample Ratio Mismatch
SRM detection
If you intended a 50/50 split but observed 50/45, randomization is broken - and any p-value you compute is unreliable. The Sample Ratio Mismatch (SRM) check catches this:
Chi-square statisticχ² = Σ (observed − expected)² / expected
If the χ² p-value is below 0.01, traffic split is significantly off - typically caused by bot filtering, redirect bugs, sticky sessions, or feature-flag leaks. SRM invalidates the experiment until fixed.
This is a non-negotiable prerequisite. A "winning" test with an SRM warning is not a winner - it's a broken experiment with an arbitrary number on it.
Monte Carlo Simulation
Empirical replication
Beyond the analytical p-value, this tool runs 5,000 Monte Carlo simulations using your observed rates as the underlying truth. Each trial draws fresh binomial samples and re-runs the Z-test:
Per trials₁ ~ Binomial(n₁, p₁) ; s₂ ~ Binomial(n₂, p₂)
The output: win rate (% of trials where variant is significantly better), loss rate, inconclusive rate, and a lift percentile distribution (P5, P25, P50, P75, P95).
This gives a tangible answer to "if I rerun this experiment 5,000 times, how often does B actually win?" - much more intuitive than abstract p-values.