§
§ · free tool

A/B test sample size. Per variation, with traffic projection.

Set your baseline conversion rate, the minimum effect you want to detect, and the statistical power and confidence levels. We compute the required sample size per variation and project how long the test will take based on your daily traffic.

Set your baseline conversion rate, the minimum detectable effect, and statistical power + confidence. We compute the required per-variation sample size and project how long the test will take given your daily traffic. Pure two-proportion z-test math.

The control group's current conversion rate.

Relative lift you want to be able to detect (e.g. 10% means baseline 3% → variant 3.3%+).

Leave blank to skip the time-to-significance projection.

required sample size · per variation
time to significance
scenario comparison · per-variation sample at this baseline + power

    Privacy: calculation happens in your browser. Nothing is sent or logged.

    § 02 · choosing the inputs

    Five settings. Each has trade-offs.

    Baseline conversion rate. The control's current conversion rate. Pull from your analytics over a representative recent window — last 4 weeks for stable funnels, last quarter for seasonal ones. The baseline drives sample size more than any other input — at 1% baseline you need 30-50× the sample of a 50% baseline test.

    Minimum detectable effect. The smallest improvement that would be worth shipping. If a 5% relative lift wouldn't change your roadmap, set MDE higher. Setting MDE smaller than your shipping threshold makes the test slower and answers a question you don't care about. Common starting MDE: 10-20% for ecommerce purchase tests, 5-10% for click-rate tests, 20-50% for low-baseline conversions.

    Confidence (1 - alpha). The rate of false positives you accept — calling a winner when there isn't one. Industry standard 95% (1 in 20 false positives). Use 99% for high-stakes irreversible changes (pricing, account deletion flows). Use 90% for low-stakes UI tests where rolling out a wrong winner is easily reversed.

    Power (1 - beta). The rate of true positives — correctly identifying real winners. Industry standard 80% (you'll miss 1 in 5 real winners). Higher power 90% appropriate when missing a winner has higher cost than rolling out a loser — for example, abandoning a treatment that would have grown revenue 30%. Lower power 70% appropriate when you can re-test cheaply.

    Variations. Most tests are 2-variation (control vs treatment). Multi-variation (3+) tests sample-divide your traffic, so each variation gets less. The required per-variation sample doesn't change with variation count, but the total traffic needed scales linearly. For 4-variation tests against a single control, consider running them in series instead of parallel — faster cycle, better signal.

    § 03 · when to use this

    Four jobs this tool covers.

    Job 1: Pre-test sizing. Before launching an A/B test, plug your baseline + the smallest lift worth shipping into the calculator. The output tells you whether your traffic supports the test. If the answer is "10 weeks at current traffic" and the test was supposed to inform a Q3 decision, you need to either widen the MDE, raise the traffic, or pick a different test.

    Job 2: Decide whether to test at all. Some tests are too expensive to run. A test on the final purchase step at 1% baseline with a 5% MDE needs ~600,000 visitors per variation; if you have 10K visitors per month, the test would take 60 months — pointless. Either pick a higher-baseline test (header CTA, not final purchase), pick a higher MDE, or skip A/B testing and ship the change with rollback ready.

    Job 3: Defend a sample-size choice. Stakeholder asks "why are we running this for three weeks?" — the calculator gives you the explicit input → output trace. "We need 12,000 per variation at 80% power and 95% confidence to detect the 10% lift we're targeting; at 1,500 visitors per day per variation, that's 8 days minimum." Defensible math beats stakeholder hand-waving.

    Job 4: Plan a testing roadmap. Run the calculator across your candidate test list. The ones with reasonable sample sizes go on the active roadmap. The ones with impractical sample sizes get re-scoped, killed, or moved to "ship and monitor" instead. Pair with our Statistical Significance Calculator for the post-test analysis side.

    § 04 · questions

    Six questions users ask.

    What's minimum detectable effect?

    MDE is the smallest improvement you want the test to be able to reliably detect. Set it as a relative percentage from baseline — a baseline of 5% conversion with an MDE of 10% means you want to be able to reliably distinguish a treatment that shifts conversion to 5.5% or above. Smaller MDEs require dramatically larger samples — halving the MDE quadruples the required sample. Pick the smallest MDE that would actually be worth shipping; setting it lower than that wastes test time.

    What do power and confidence mean?

    Confidence (1 - alpha, default 95%) is the rate of false positives you tolerate — how often you'd call a winning treatment when there isn't one. 95% means 1-in-20 false positives. Power (1 - beta, default 80%) is the rate of true positives — how often you'd correctly identify a real winner. 80% means 1-in-5 missed real wins. Both defaults are industry standard. Higher confidence (99%) is appropriate for high-stakes changes; higher power (90%) is appropriate when missing a winner is more costly than rolling out a loser.

    Why is the required sample so large?

    Baseline conversion rate is the dominant factor. At 50% baseline (a click-rate scenario), the sample is small. At 1% baseline (a typical purchase rate), the sample is roughly 50× larger because the variance is dominated by the rare event. Most ecommerce purchase tests need tens of thousands of visitors per variation; high-funnel button-click tests need only hundreds. Lower baselines also benefit more from absolute-MDE thinking instead of relative — a 1% baseline shifting to 1.5% (50% relative lift) is an absolute change of 0.5pp.

    Should I peek at results during the test?

    No, not for fixed-horizon tests like this calculator assumes. Peeking inflates false-positive rate above the nominal alpha — the more you peek, the higher the rate of calling a winner that's actually noise. If you must peek, use sequential or Bayesian methods (Optimizely, VWO, AB Tasty all support these); they're built to handle continuous monitoring. For naive frequentist tests, set the sample size up front, run to that sample, then look at the result.

    What if my traffic is small?

    Three options. (1) Test higher in the funnel where the conversion rate is bigger — testing a header CTA click is usually 10-20× higher conversion than testing the final purchase, so the sample needed is much smaller. (2) Test bigger changes (larger MDE) so you can detect with less traffic. (3) Run the test longer; most tools give you the ability to run for several weeks. If you can't get statistically significant results within 4-6 weeks, the test is probably not worth running — the cost of decision delay outweighs the value of certainty.

    Is the data I enter sent anywhere?

    No. Calculation happens entirely in your browser. The page is static HTML; the only network request is the initial page load. Safe for sensitive conversion-rate data, internal projections, or any number you wouldn't want shared with a third party.