Statistical Significance Engine

Validate experiments. Ship with confidence.

Two-proportion Z-test with Wilson confidence intervals, power analysis, Monte Carlo simulation, and multi-variant Bonferroni correction - all running locally in your browser.

100% Client-side
Wilson CIs
5,000 Trials
Excel Export

Common questions

Everything you need to know about A/B testing statistics and how this calculator works.

What is A/B testing?

A complete primer on randomized experimentation - what A/B tests measure, when to use them, and how the statistics actually work.

Definition

The core idea

A/B testing - also called split testing or randomized controlled experimentation - measures the causal impact of a change by simultaneously showing two (or more) versions to randomly-assigned users. One group sees the original (control); other groups see variants.

Because users are split randomly and concurrently, every other factor - seasonality, marketing pulses, weather, weekday, competitor activity - averages out across both groups. The only systematic difference is the variant itself, so any measurable performance gap can be confidently attributed to the change.

This calculator answers two questions: did the variant produce a different conversion rate than the control? (a statistical question - Z-test and p-value), and is that difference big enough to act on? (a practical question - effect size, lift, and sample size adequacy).

When to use

Right tool, right job

A/B testing is the gold standard whenever you can randomize traffic between concurrent versions. Common applications:

  • Landing page changes - headlines, CTAs, hero imagery
  • Pricing experiments - when feature-flagged at user level
  • Checkout flow variants - single-step vs multi-step
  • Email subject lines - open-rate optimization
  • Onboarding flows - different first-run experiences
  • Search ranking algorithms - relevance experiments

If you cannot randomize traffic - site-wide redesigns, brand campaigns, infrastructure changes - use the Pre/Post Impact Analyzer instead. It's the right tool when randomization isn't possible, even though it provides directional rather than causal evidence.

Formulas behind the numbers

Every statistic in this calculator is computed from the formulas below - fully open, fully transparent, no black boxes.

Test Statistic

Two-proportion Z-test

The Z-test compares conversion rates between control (rate p₁) and variant (rate p₂):

Pooled ratep̂ = (c₁ + c₂) / (n₁ + n₂) Pooled standard errorSE = √[ p̂·(1−p̂) · (1/n₁ + 1/n₂) ] Z-statisticZ = (p₂p₁) / SE Two-sided p-valuep = 2 · (1 − Φ(|Z|))

Where Φ is the standard normal CDF. A p-value below your chosen alpha (typically 0.05) means the observed difference is unlikely to be random noise - your variant truly performs differently.

Confidence Intervals

Wilson score interval

For each individual rate, this calculator uses the Wilson score interval - more accurate than the standard Wald interval, especially for small samples or extreme rates near 0% or 100%:

Wilson 95% CI( + /2n ± z·√[(1−)/n + /4n²]) / (1 + /n)

Where z is the critical value for your confidence level (1.96 for 95%, 2.576 for 99%). Wilson intervals never produce nonsensical bounds outside [0, 1] and remain reliable even with as few as 5–10 conversions.

The Wald interval (textbook standard) can give intervals like −2% to 8% for low-traffic variants, which is mathematically meaningless. Wilson fixes this.

Statistical Power

Power analysis

Power is the probability of correctly detecting a real effect when one exists. Industry standard: target 80% power. Below that, you risk false negatives (Type II errors) - concluding "no effect" when the variant actually works.

Achieved power1 − β = Φ(|p₂p₁|/SE − zα/2) + Φ(−|p₂p₁|/SE − zα/2) Required sample size (n per variant)n = ((zα/2·√[2·p₁(1−p₁)] + zβ·√[p₁(1−p₁) + p₂(1−p₂)])²) / (p₂p₁

Smaller effects (smaller MDE) and lower baseline rates require dramatically larger samples. Detecting a 1% absolute lift on a 5% baseline needs roughly 4× the traffic of detecting a 2% lift on the same baseline.

Multiple Comparisons

Bonferroni correction

When you test multiple variants against control simultaneously, you multiply your false-positive risk. Three variants at 95% confidence give roughly a 14% chance of at least one false positive - not 5%.

Adjusted alphaαadj = α / k

Where k is the number of variants compared to control. This calculator applies Bonferroni automatically when you select A/B/C (k=2) or A/B/C/D (k=3). The threshold is divided by k, making each individual test stricter so the overall family-wise error rate stays at α.

Bonferroni is conservative - it's the safest correction for production decisions. More powerful methods exist (Holm-Bonferroni, Benjamini-Hochberg) but require careful interpretation; we default to the simple, safe choice.

Sample Ratio Mismatch

SRM detection

If you intended a 50/50 split but observed 50/45, randomization is broken - and any p-value you compute is unreliable. The Sample Ratio Mismatch (SRM) check catches this:

Chi-square statisticχ² = Σ (observedexpected)² / expected

If the χ² p-value is below 0.01, traffic split is significantly off - typically caused by bot filtering, redirect bugs, sticky sessions, or feature-flag leaks. SRM invalidates the experiment until fixed.

This is a non-negotiable prerequisite. A "winning" test with an SRM warning is not a winner - it's a broken experiment with an arbitrary number on it.

Monte Carlo Simulation

Empirical replication

Beyond the analytical p-value, this tool runs 5,000 Monte Carlo simulations using your observed rates as the underlying truth. Each trial draws fresh binomial samples and re-runs the Z-test:

Per trials₁ ~ Binomial(n₁, p₁) ; s₂ ~ Binomial(n₂, p₂)

The output: win rate (% of trials where variant is significantly better), loss rate, inconclusive rate, and a lift percentile distribution (P5, P25, P50, P75, P95).

This gives a tangible answer to "if I rerun this experiment 5,000 times, how often does B actually win?" - much more intuitive than abstract p-values.

A/B testing vs pre/post analysis

Two methods, two purposes. Choose based on what you can randomize.

Dimension A/B Test Pre/Post Analysis
Comparison structure Two concurrent random groups Two sequential time periods
Causal claim Strong (causal inference) Directional only
Confounding risk Low (randomization controls for it) High (seasonality, externals)
Best for Component changes, copy variants, feature flags Site-wide changes, brand campaigns
Required setup Traffic-splitting infrastructure Historical + post data
Sample size Plan ahead with power analysis Whatever traffic you had
Tools at Datapad This page Pre/Post Analyzer

Real-world A/B test examples

How teams actually apply A/B testing to ship winners with confidence.

E-commerce

Checkout button copy

An e-commerce store tests "Buy Now" (control, 10,000 visitors, 450 sales = 4.5%) against "Add to Bag" (variant, 10,000 visitors, 520 sales = 5.2%).

Plugged into this calculator: relative lift = +15.6%, p-value = 0.027, power = 62%, Wilson CI for variant: [4.78%, 5.65%].

Verdict: Significant at 95%, but power is below 80% - the test would benefit from another week of traffic to be conclusive. Wait before shipping.

SaaS

Pricing page redesign

A SaaS team tests two pricing layouts. Control: 8,200 trial signups, 412 paid (5.0%). Variant: 8,150 signups, 530 paid (6.5%).

Relative lift = +29.4%, p < 0.001, power = 92%, Cohen's effect-size diagnostics confirm a meaningful change.

Action: Ship the variant. The high power and tight confidence interval ([5.99%, 7.04%]) make this a clean win. Document for the experimentation log.

Multi-variant

A/B/C/D landing-page test

A growth team tests 3 hero headlines (B, C, D) against control (A). Each variant gets 10,000 visitors. With 3 comparisons, this calculator applies Bonferroni: αadj = 0.05 / 3 = 0.0167.

Result: B p = 0.04 (not significant after Bonferroni), C p = 0.008 (winner), D p = 0.31 (no effect).

Lesson: Without Bonferroni, B would have looked like a winner - but the family-wise error rate would be inflated. Always correct for multiple comparisons in multivariate tests.

Underpowered

Low-traffic test

A blog tests two newsletter signup placements. Control: 1,200 visits, 36 signups (3.0%). Variant: 1,180 visits, 47 signups (4.0%).

Relative lift = +33.3%, but p = 0.18 - not significant. Power: only 28%. Sample size needed for 80% power: ~7,300 per variant.

Action: Don't conclude "no difference." Run the test 5–6× longer, or accept that the effect (if real) is too small to detect at current traffic levels.

Common A/B testing mistakes

The traps that turn legitimate experiments into false wins and shipped regressions.

Mistake #1

Peeking and stopping early

Checking results daily and stopping the moment p < 0.05 inflates the false-positive rate dramatically - well past the 5% you think you're getting. With 10 daily checks, true Type I error can exceed 30%.

Fix: Pre-commit a sample size based on power analysis, then run until you hit it. If you must peek, use sequential testing methods (mSPRT, group-sequential designs) instead of fixed-horizon Z-tests.

Mistake #2

Running too short

Conversion behavior varies dramatically by day-of-week and time-of-month. A 3-day test in the middle of a week misses weekend visitors entirely.

Fix: Run for at least one full business cycle (typically 7-14 days) regardless of when you hit your sample size target. Longer for B2B or high-consideration purchases where decision cycles span weeks.

Mistake #3

Ignoring SRM warnings

Sample Ratio Mismatch means your randomization is broken. Common causes: bot filtering hits one variant more, redirect logic breaks for one cohort, feature flags leak across sessions, caching layer misroutes.

Fix: Treat any SRM p-value below 0.01 as a hard stop. Investigate the pipeline, fix the bug, restart the experiment. A "win" with an SRM warning is not a win.

Mistake #4

HARKing and post-hoc segments

"Hypothesizing After Results are Known" - slicing data by browser, country, device, plan tier, and finding "significant wins" in some segment after the overall test failed. Each slice is a new test; significance becomes a coincidence game.

Fix: Pre-register your primary metric and segments. Treat post-hoc findings as hypotheses for the next experiment, never as conclusions from the current one.

Mistake #5

Underpowered tests

Running a test with 30% power means you have a 70% chance of missing a real effect. You'll often conclude "no difference" when the variant actually works - and discard a real win.

Fix: Use the Power Analysis tab to size your test before you launch. If your traffic can't support 80% power at your minimum interesting effect, the experiment isn't worth running - pick a higher-traffic surface or a more dramatic intervention.

Mistake #6

Confusing significance with importance

With millions of users, almost any tiny change becomes "statistically significant." A 0.05% lift on 1M users gives p < 0.001 but no real-world impact. Significance is necessary but not sufficient.

Fix: Always evaluate three things together - p-value (is it real?), effect size (how big?), and business impact (worth shipping?). All three should support the decision.

Free tools from Datapad that pair well with A/B testing.