What p-value is statistically significant for an A/B test?

At the standard 95% confidence level, a p-value below 0.05 is considered statistically significant. At 99% confidence, you need p < 0.01. The lower the p-value, the stronger the evidence against the null hypothesis (that there is no real difference between variants).

How many visitors do I need for an A/B test?

Required sample size depends on four factors: your baseline conversion rate, the minimum detectable effect (MDE) you want to catch, your desired statistical power (typically 80%), and your significance level (typically 95%). Smaller effects and lower baseline rates require larger samples. The Power Analysis tab calculates exact requirements for your inputs.

What is statistical power in A/B testing?

Statistical power is the probability of correctly detecting a real effect when one exists. The standard practice is to target 80% power, meaning if your variant truly performs differently, you have an 80% chance of catching it given your current sample size. Lower power increases the risk of false negatives (Type II errors).

What is Bonferroni correction and when is it applied?

Bonferroni correction adjusts the significance threshold to control the family-wise error rate when testing multiple variants simultaneously. With 3 variants compared to control, alpha is divided by 2 (number of comparisons). Our calculator applies it automatically for tests with 3 or more variants.

What is Sample Ratio Mismatch (SRM)?

Sample Ratio Mismatch occurs when the observed traffic split between variants differs significantly from your expected split (e.g., 50/50). It often indicates randomization or tracking problems and can invalidate your results. We run a chi-square test on every analysis to flag SRM issues.

Should I use a one-sided or two-sided test?

Two-sided tests (default and recommended) check whether your variant differs from the control in either direction. One-sided tests only check whether your variant is better. Use one-sided only when negative outcomes are impossible - otherwise you risk missing variants that hurt performance.

What is Monte Carlo simulation in A/B testing?

Monte Carlo simulation runs thousands of synthetic experiments using your observed rates as the underlying truth. It estimates the probability that each variant beats control empirically, complementing the analytical p-value with an intuitive 'how often does B win' answer. Our tool runs 5,000 binomial trials per analysis.

What is the difference between absolute lift and relative lift?

Absolute lift is the percentage-point change between rates (e.g., 2% to 3% is +1 percentage point). Relative lift expresses the same change as a percentage of the original (e.g., 2% to 3% is +50% relative lift). The same change can sound very different depending on which one you report - always clarify.

How long should an A/B test run?

Run an A/B test long enough to reach the sample size required for 80% power at your minimum detectable effect (use the Power Analysis tab to compute this). Also run for at least one full business cycle (typically 7-14 days) to capture weekday/weekend variation. Stopping early - at the first p < 0.05 - leads to false positives.

A/B Test Calculator – Free Statistical Significance Tool

Methodology

What is A/B testing?

A complete primer on randomized experimentation - what A/B tests measure, when to use them, and how the statistics actually work.

Definition

The core idea

A/B testing - also called split testing or randomized controlled experimentation - measures the causal impact of a change by simultaneously showing two (or more) versions to randomly-assigned users. One group sees the original (control); other groups see variants.

Because users are split randomly and concurrently, every other factor - seasonality, marketing pulses, weather, weekday, competitor activity - averages out across both groups. The only systematic difference is the variant itself, so any measurable performance gap can be confidently attributed to the change.

This calculator answers two questions: did the variant produce a different conversion rate than the control? (a statistical question - Z-test and p-value), and is that difference big enough to act on? (a practical question - effect size, lift, and sample size adequacy).

When to use

Right tool, right job

A/B testing is the gold standard whenever you can randomize traffic between concurrent versions. Common applications:

Landing page changes - headlines, CTAs, hero imagery
Pricing experiments - when feature-flagged at user level
Checkout flow variants - single-step vs multi-step
Email subject lines - open-rate optimization
Onboarding flows - different first-run experiences
Search ranking algorithms - relevance experiments

If you cannot randomize traffic - site-wide redesigns, brand campaigns, infrastructure changes - use the Pre/Post Impact Analyzer instead. It's the right tool when randomization isn't possible, even though it provides directional rather than causal evidence.

Statistical Formulas

Formulas behind the numbers

Every statistic in this calculator is computed from the formulas below - fully open, fully transparent, no black boxes.

Test Statistic

Two-proportion Z-test

The Z-test compares conversion rates between control (rate p₁) and variant (rate p₂):

Pooled ratep̂ = (c₁ + c₂) / (n₁ + n₂) Pooled standard errorSE = √[ p̂·(1−p̂) · (1/n₁ + 1/n₂) ] Z-statisticZ = (p₂ − p₁) / SE Two-sided p-valuep = 2 · (1 − Φ(|Z|))

Where Φ is the standard normal CDF. A p-value below your chosen alpha (typically 0.05) means the observed difference is unlikely to be random noise - your variant truly performs differently.

Confidence Intervals

Wilson score interval

For each individual rate, this calculator uses the Wilson score interval - more accurate than the standard Wald interval, especially for small samples or extreme rates near 0% or 100%:

Wilson 95% CI(p̂ + z²/2n ± z·√[p̂(1−p̂)/n + z²/4n²]) / (1 + z²/n)

Where z is the critical value for your confidence level (1.96 for 95%, 2.576 for 99%). Wilson intervals never produce nonsensical bounds outside [0, 1] and remain reliable even with as few as 5–10 conversions.

The Wald interval (textbook standard) can give intervals like −2% to 8% for low-traffic variants, which is mathematically meaningless. Wilson fixes this.

Statistical Power

Power analysis

Power is the probability of correctly detecting a real effect when one exists. Industry standard: target 80% power. Below that, you risk false negatives (Type II errors) - concluding "no effect" when the variant actually works.

Achieved power1 − β = Φ(|p₂−p₁|/SE − z_α/2) + Φ(−|p₂−p₁|/SE − z_α/2) Required sample size (n per variant)n = ((z_α/2·√[2·p₁(1−p₁)] + z_β·√[p₁(1−p₁) + p₂(1−p₂)])²) / (p₂−p₁)²

Smaller effects (smaller MDE) and lower baseline rates require dramatically larger samples. Detecting a 1% absolute lift on a 5% baseline needs roughly 4× the traffic of detecting a 2% lift on the same baseline.

Multiple Comparisons

Bonferroni correction

When you test multiple variants against control simultaneously, you multiply your false-positive risk. Three variants at 95% confidence give roughly a 14% chance of at least one false positive - not 5%.

Adjusted alphaα_adj = α / k

Where k is the number of variants compared to control. This calculator applies Bonferroni automatically when you select A/B/C (k=2) or A/B/C/D (k=3). The threshold is divided by k, making each individual test stricter so the overall family-wise error rate stays at α.

Bonferroni is conservative - it's the safest correction for production decisions. More powerful methods exist (Holm-Bonferroni, Benjamini-Hochberg) but require careful interpretation; we default to the simple, safe choice.

Sample Ratio Mismatch

SRM detection

If you intended a 50/50 split but observed 50/45, randomization is broken - and any p-value you compute is unreliable. The Sample Ratio Mismatch (SRM) check catches this:

Chi-square statisticχ² = Σ (observed − expected)² / expected

If the χ² p-value is below 0.01, traffic split is significantly off - typically caused by bot filtering, redirect bugs, sticky sessions, or feature-flag leaks. SRM invalidates the experiment until fixed.

This is a non-negotiable prerequisite. A "winning" test with an SRM warning is not a winner - it's a broken experiment with an arbitrary number on it.

Monte Carlo Simulation

Empirical replication

Beyond the analytical p-value, this tool runs 5,000 Monte Carlo simulations using your observed rates as the underlying truth. Each trial draws fresh binomial samples and re-runs the Z-test:

Per trials₁ ~ Binomial(n₁, p₁) ; s₂ ~ Binomial(n₂, p₂)

The output: win rate (% of trials where variant is significantly better), loss rate, inconclusive rate, and a lift percentile distribution (P5, P25, P50, P75, P95).

This gives a tangible answer to "if I rerun this experiment 5,000 times, how often does B actually win?" - much more intuitive than abstract p-values.

Comparison

A/B testing vs pre/post analysis

Two methods, two purposes. Choose based on what you can randomize.

Dimension	A/B Test	Pre/Post Analysis
Comparison structure	Two concurrent random groups	Two sequential time periods
Causal claim	Strong (causal inference)	Directional only
Confounding risk	Low (randomization controls for it)	High (seasonality, externals)
Best for	Component changes, copy variants, feature flags	Site-wide changes, brand campaigns
Required setup	Traffic-splitting infrastructure	Historical + post data
Sample size	Plan ahead with power analysis	Whatever traffic you had
Tools at Datapad	This page	Pre/Post Analyzer

Examples

Real-world A/B test examples

How teams actually apply A/B testing to ship winners with confidence.

E-commerce

Checkout button copy

An e-commerce store tests "Buy Now" (control, 10,000 visitors, 450 sales = 4.5%) against "Add to Bag" (variant, 10,000 visitors, 520 sales = 5.2%).

Plugged into this calculator: relative lift = +15.6%, p-value = 0.027, power = 62%, Wilson CI for variant: [4.78%, 5.65%].

Verdict: Significant at 95%, but power is below 80% - the test would benefit from another week of traffic to be conclusive. Wait before shipping.

SaaS

Pricing page redesign

A SaaS team tests two pricing layouts. Control: 8,200 trial signups, 412 paid (5.0%). Variant: 8,150 signups, 530 paid (6.5%).

Relative lift = +29.4%, p < 0.001, power = 92%, Cohen's effect-size diagnostics confirm a meaningful change.

Action: Ship the variant. The high power and tight confidence interval ([5.99%, 7.04%]) make this a clean win. Document for the experimentation log.

Multi-variant

A/B/C/D landing-page test

A growth team tests 3 hero headlines (B, C, D) against control (A). Each variant gets 10,000 visitors. With 3 comparisons, this calculator applies Bonferroni: α_adj = 0.05 / 3 = 0.0167.

Result: B p = 0.04 (not significant after Bonferroni), C p = 0.008 (winner), D p = 0.31 (no effect).

Lesson: Without Bonferroni, B would have looked like a winner - but the family-wise error rate would be inflated. Always correct for multiple comparisons in multivariate tests.

Underpowered

Low-traffic test

A blog tests two newsletter signup placements. Control: 1,200 visits, 36 signups (3.0%). Variant: 1,180 visits, 47 signups (4.0%).

Relative lift = +33.3%, but p = 0.18 - not significant. Power: only 28%. Sample size needed for 80% power: ~7,300 per variant.

Action: Don't conclude "no difference." Run the test 5–6× longer, or accept that the effect (if real) is too small to detect at current traffic levels.

Pitfalls

Common A/B testing mistakes

The traps that turn legitimate experiments into false wins and shipped regressions.

Mistake #1

Peeking and stopping early

Checking results daily and stopping the moment p < 0.05 inflates the false-positive rate dramatically - well past the 5% you think you're getting. With 10 daily checks, true Type I error can exceed 30%.

Fix: Pre-commit a sample size based on power analysis, then run until you hit it. If you must peek, use sequential testing methods (mSPRT, group-sequential designs) instead of fixed-horizon Z-tests.

Mistake #2

Running too short

Conversion behavior varies dramatically by day-of-week and time-of-month. A 3-day test in the middle of a week misses weekend visitors entirely.

Fix: Run for at least one full business cycle (typically 7-14 days) regardless of when you hit your sample size target. Longer for B2B or high-consideration purchases where decision cycles span weeks.

Mistake #3

Ignoring SRM warnings

Sample Ratio Mismatch means your randomization is broken. Common causes: bot filtering hits one variant more, redirect logic breaks for one cohort, feature flags leak across sessions, caching layer misroutes.

Fix: Treat any SRM p-value below 0.01 as a hard stop. Investigate the pipeline, fix the bug, restart the experiment. A "win" with an SRM warning is not a win.

Mistake #4

HARKing and post-hoc segments

"Hypothesizing After Results are Known" - slicing data by browser, country, device, plan tier, and finding "significant wins" in some segment after the overall test failed. Each slice is a new test; significance becomes a coincidence game.

Fix: Pre-register your primary metric and segments. Treat post-hoc findings as hypotheses for the next experiment, never as conclusions from the current one.

Mistake #5

Underpowered tests

Running a test with 30% power means you have a 70% chance of missing a real effect. You'll often conclude "no difference" when the variant actually works - and discard a real win.

Fix: Use the Power Analysis tab to size your test before you launch. If your traffic can't support 80% power at your minimum interesting effect, the experiment isn't worth running - pick a higher-traffic surface or a more dramatic intervention.

Mistake #6

Confusing significance with importance

With millions of users, almost any tiny change becomes "statistically significant." A 0.05% lift on 1M users gives p < 0.001 but no real-world impact. Significance is necessary but not sufficient.

Fix: Always evaluate three things together - p-value (is it real?), effect size (how big?), and business impact (worth shipping?). All three should support the decision.

Related Tools

Free tools from Datapad that pair well with A/B testing.

Validate experiments. Ship with confidence.

Common questions

What is A/B testing?

The core idea

Right tool, right job

Formulas behind the numbers

Two-proportion Z-test

Wilson score interval

Power analysis

Bonferroni correction

SRM detection

Empirical replication

A/B testing vs pre/post analysis

Real-world A/B test examples

Checkout button copy

Pricing page redesign

A/B/C/D landing-page test

Low-traffic test

Common A/B testing mistakes

Peeking and stopping early

Running too short

Ignoring SRM warnings

HARKing and post-hoc segments

Underpowered tests

Confusing significance with importance

Validate experiments. Ship with confidence.

Common questions

What is A/B testing?

The core idea

Right tool, right job

Formulas behind the numbers

Two-proportion Z-test

Wilson score interval

Power analysis

Bonferroni correction

SRM detection

Empirical replication

A/B testing vs pre/post analysis

Real-world A/B test examples

Checkout button copy

Pricing page redesign

A/B/C/D landing-page test

Low-traffic test

Common A/B testing mistakes

Peeking and stopping early

Running too short

Ignoring SRM warnings

HARKing and post-hoc segments

Underpowered tests

Confusing significance with importance

Continue your analysis