Home Services Work About Book

§ · free tool

Statistical significance. After the test, before the call.

Q: What's the difference between absolute and relative lift?

Absolute lift is variant rate minus control rate (e.g. 5.2% - 5.0% = 0.2 percentage points or 0.2pp). Relative lift is variant divided by control (5.2% / 5.0% - 1 = 4% relative). For low baselines, relative lift looks dramatic — a baseline of 1% lifting to 1.5% is a 50% relative lift but only 0.5pp absolute. Both numbers tell a story; we surface both. For decision-making, use whichever frames the trade-off best — relative lift sounds bigger, absolute lift converts directly to incremental revenue.

Q: Two-tailed or one-tailed test?

We use two-tailed by default — the test answers 'is the variant different from control,' allowing the difference to go either direction. One-tailed tests answer 'is the variant better than control' specifically. One-tailed gives slightly more power (smaller required samples for the same significance) but only if you committed to the direction up front. If you'd accept a winner in either direction, two-tailed is honest. If you have a clear prior that the variant should improve, one-tailed is justifiable. Most A/B test tools default to two-tailed for honesty reasons.

Q: Why is significance still 'no' even though variant looks better?

Three common reasons. (1) Small sample — even visible-looking lifts can fall inside the noise band on small samples. Run our Sample Size Calculator to check whether your test had enough volume. (2) Small effect size — the lift you observed is just barely above zero. (3) Both — small sample AND small effect. Significance at 95% requires the observed difference to be unlikely under the null hypothesis; tiny differences on small samples are exactly what the null hypothesis predicts. Either run longer, or accept the test was too underpowered to draw a conclusion.

Q: What if I tested multiple variations at once?

This calculator is for single control-vs-variant comparisons. Multi-variation tests with one control require multiple-comparison corrections (Bonferroni, Benjamini-Hochberg) to control the family-wise error rate — running 5 comparisons at p < 0.05 each gives you a ~23% chance of a false positive on the family. For multi-arm tests, use a tool like Optimizely or VWO that handles the correction natively, or apply Bonferroni manually (divide your alpha by the number of comparisons). For 2-variation tests, this calculator is correct.

Q: Is the data I enter sent anywhere?

No. Calculation happens entirely in your browser. The page is static HTML; the only network request is the initial page load. Safe for sensitive conversion data, internal A/B results, or any number you wouldn't want shared with a third party.

Enter control + variant visitors and conversions. We run a two-proportion z-test, surface the p-value, confidence, and lift, and render a verdict at your chosen threshold (90 / 95 / 99%). Browser-only.

See growth strategy service

Enter control + variant visitors and conversions. We run a two-proportion z-test and surface the p-value, confidence, absolute and relative lift. The verdict tells you whether the result clears your significance threshold. Pure browser math.

control

Visitors

Conversions

—

variant

Visitors

Conversions

—

Significance threshold

Test direction

verdict

—

p-value

—

confidence

—

relative lift

—

vs control rate

absolute lift

—

percentage points

technical

—

Pre-test sizing → Conversion rate →

Privacy: calculation happens in your browser. Nothing is sent or logged.

§ 02 · reading the verdict

Four states. Different actions.

Significant winner. p-value below threshold AND variant beats control. Ship the variant. Document the lift, document the change, move on. The remaining decision is whether the absolute lift is worth the implementation complexity — a 50% relative lift on a tiny baseline (1% → 1.5%) is just 0.5pp absolute and may not justify the code.

Significant loser. p-value below threshold AND variant performs worse than control. Discard the variant. The data is conclusive that the change you tested makes things worse. Save the experiment notes — the negative result is information about what the audience cares about. Don't re-run the same test hoping for a better outcome.

Suggestive but not significant. p-value 0.05-0.20, lift trending in some direction. The test is under-powered for the effect size — there's a signal but not enough sample to call it. Either extend the test (use our Sample Size Calculator to estimate how much more) or accept inconclusive and move to the next test.

Not significant. p-value above 0.20. The observed difference is well inside the noise band; treat as a tie. The variant didn't help; it didn't clearly hurt either. Move on without shipping. Don't fall into the trap of running for another 6 weeks hoping for significance — if 4 weeks didn't produce a signal at this baseline, another 4 weeks rarely will either.

§ 03 · when to use this

Four jobs this tool covers.

Job 1: Read out an A/B test result. The dashboard in your testing tool gives you the same numbers, but most marketers don't know whether to trust them. Paste the visitor + conversion counts here, get the explicit verdict + p-value + confidence + lift. Useful as a sanity check on the testing tool's own significance read.

Job 2: Re-analyze a test at a different threshold. Your testing tool defaults to 95% — what would the result look like at 90% (more permissive) or 99% (stricter)? Switch the threshold dropdown and see whether the call changes. Useful when the same test result needs to defend itself in different contexts (engineering wants 99%, marketing accepts 90%).

Job 3: Quick-check pre-rollout. Before flipping a feature flag from 50/50 to 100/0, paste the cumulative test data and confirm the significance. The 5 minutes spent confirming saves the rollback cost of shipping a non-significant change as a winner.

Job 4: Educate stakeholders. "What's a p-value?" The verdict explanation written here gives stakeholders a one-line interpretation they can repeat in the next meeting. Useful as a teaching tool for non-technical PMs and execs who need to read A/B results without misinterpreting them. Pair with our Sample Size Calculator for the pre-test side of the conversation.

§ 04 · questions

Six questions users ask.

What's a p-value?

The probability of observing a difference at least as large as the one you saw, IF the true conversion rates of control and variant were identical (the null hypothesis). p = 0.03 means a 3% chance the observed lift is just noise. Lower p means more confidence the effect is real. Convention: significant at p < 0.05 (95% confidence), strongly significant at p < 0.01 (99%). The p-value is NOT the probability the variant is better than control — that's a common misreading; the actual interpretation is the probability of the result given no effect.

What's the difference between absolute and relative lift?

Absolute lift is variant rate minus control rate (e.g. 5.2% - 5.0% = 0.2 percentage points or 0.2pp). Relative lift is variant divided by control (5.2% / 5.0% - 1 = 4% relative). For low baselines, relative lift looks dramatic — a baseline of 1% lifting to 1.5% is a 50% relative lift but only 0.5pp absolute. Both numbers tell a story; we surface both. For decision-making, use whichever frames the trade-off best — relative lift sounds bigger, absolute lift converts directly to incremental revenue.

Two-tailed or one-tailed test?

We use two-tailed by default — the test answers 'is the variant different from control,' allowing the difference to go either direction. One-tailed tests answer 'is the variant better than control' specifically. One-tailed gives slightly more power (smaller required samples for the same significance) but only if you committed to the direction up front. If you'd accept a winner in either direction, two-tailed is honest. If you have a clear prior that the variant should improve, one-tailed is justifiable. Most A/B test tools default to two-tailed for honesty reasons.

Why is significance still 'no' even though variant looks better?

Three common reasons. (1) Small sample — even visible-looking lifts can fall inside the noise band on small samples. Run our Sample Size Calculator to check whether your test had enough volume. (2) Small effect size — the lift you observed is just barely above zero. (3) Both — small sample AND small effect. Significance at 95% requires the observed difference to be unlikely under the null hypothesis; tiny differences on small samples are exactly what the null hypothesis predicts. Either run longer, or accept the test was too underpowered to draw a conclusion.

What if I tested multiple variations at once?

This calculator is for single control-vs-variant comparisons. Multi-variation tests with one control require multiple-comparison corrections (Bonferroni, Benjamini-Hochberg) to control the family-wise error rate — running 5 comparisons at p < 0.05 each gives you a ~23% chance of a false positive on the family. For multi-arm tests, use a tool like Optimizely or VWO that handles the correction natively, or apply Bonferroni manually (divide your alpha by the number of comparisons). For 2-variation tests, this calculator is correct.

Is the data I enter sent anywhere?

No. Calculation happens entirely in your browser. The page is static HTML; the only network request is the initial page load. Safe for sensitive conversion data, internal A/B results, or any number you wouldn't want shared with a third party.

Four states. Different actions.

Four jobs this tool covers.

Six questions users ask.

Testing + growth cluster.

A/B Sample Size

Conversion Rate

LTV:CAC Ratio

LTV Calculator

Runway Calculator

Growth strategy