§
§ · journal

A/B testing · the engineer's lens on conversion optimization.

Sample-size math, sequential testing, GA4 plus server-side eventing, statistical versus practical significance. The methodology behind real CRO at $1M-$10M ecom scale, not the colour-of-the-button version.

§ 01 · TL;DR

A/B testing is engineering, not opinion.

A/B testing is the conversion-optimization vehicle that decides which proposed change actually earns rollout. The engineering view: a power calculation says how big a sample you need before you can detect the effect size you care about, a fixed test plan says when you stop and what you stop on, GA4 plus server-side eventing carries the measurement, segmentation needs pre-registration to avoid p-hacking, and statistical significance is necessary but not sufficient. Practical significance (is the lift big enough to ship) is the second gate. Most ecom A/B tests on $1M-$10M stores fail the engineering test before the marketing test, which is why most reported wins do not replicate. This piece is the methodology, covering sample-size math, sequential testing, GA4 wiring, segmentation discipline, and three real DH client patterns. The companion stage-based piece on conversion optimization strategies covers the broader CRO programme; this one covers the testing infrastructure underneath it.

§ 02 · the engineer's definition

A controlled experiment with two or more arms, random assignment, and a primary metric.

The marketer's version of A/B testing is "we changed the headline and conversions went up." The engineer's version is more careful: a controlled experiment in which traffic is randomly assigned to two or more variants, a primary metric is declared in advance, the variants are observed for a fixed duration or sample size, and the difference between variants is evaluated against a probability model that accounts for sampling noise. The pieces matter individually; remove any one and the test stops being a test. The Wikipedia article on A/B testing covers the canonical definition; the underlying statistical theory is covered in the article on Student's t-test for continuous metrics or the chi-squared test for proportions like conversion rate.

Most ecom A/B tests fail at one of four points. Random assignment fails when one variant gets disproportionately more mobile traffic than the other (a sample-ratio mismatch, often caused by client-side flicker that disproportionately turns away slower devices). Primary metric fails when the team agrees post hoc that revenue per visitor matters more than conversion rate, after seeing the conversion-rate result. Fixed duration fails when the team peeks daily and stops the test the day the variant looks ahead. Probability evaluation fails when nobody runs the calculation and "B beat A by 6 percent" gets read as a win when the test had 8 percent statistical noise.

The reason this matters at $1M-$10M ecom scale specifically is that traffic volumes there are exactly the wrong size: large enough to feel like A/B testing should work, small enough that most tests are statistically underpowered. Brands at $50M+ have the volume to detect 1-2 percent relative lifts; brands at $200K-$1M can usually skip A/B testing and ship on judgement because the traffic doesn't support it. The middle band is the testing-trap zone, and most CRO retainers operate inside it.

The framing for this piece: A/B testing is one tool inside a broader growth strategy programme. It belongs alongside qualitative research, session replay, post-purchase surveys, and analytical work that doesn't require an experiment. The mistake is to treat A/B testing as the whole CRO programme; it's the gate at the end of the pipeline, not the pipeline itself. The companion piece on conversion-rate optimization strategies covers the wider stage-based decision matrix; this one covers what good testing operations look like.

§ 03 · the sample-size problem

Most ecom A/B tests are statistically meaningless before they start.

A power calculation is not optional. It's the first thing you run before you write the variant.

A power calculation takes four inputs and returns one number: how many visitors per variant you need before the test can detect the effect size you've decided is worth detecting. The four inputs are baseline conversion rate (the current, control performance), minimum detectable effect or MDE (the smallest relative or absolute lift you care about), alpha (the false-positive rate you'll tolerate, typically 0.05), and statistical power (1-beta, the chance you'll catch a real effect, typically 0.8). The Wikipedia article on statistical significance covers the underlying frequentist theory.

Worked example. Baseline conversion rate 2.5 percent (typical for a $5M Shopify store on warm traffic). MDE 10 percent relative, meaning you want to detect lifts where 2.5 percent becomes 2.75 percent or higher. Alpha 0.05, power 0.8, two-tailed test. The required sample size lands at roughly 24,500 visitors per variant. Total traffic for a two-arm test: 49,000 visitors. If your store does 80,000 visitors a month, you need three weeks of full-traffic exposure for that test. If your store does 20,000 a month, the test takes 12 weeks, past the four-week cap on test validity, so the test is functionally untestable at that traffic level.

Now drop the baseline. At 1.5 percent (typical for cold-traffic landing pages), same 10 percent relative MDE, the requirement climbs to about 41,000 per variant. At 5 percent baseline, drops to about 12,000. Conversion rate enters the calculation quadratically; small baselines blow up sample requirements. The implication: do not run conversion-rate-as-primary-metric tests on cold paid-traffic landing pages unless you have serious volume. Use add-to-cart, lead form completion, or subscription opt-in as the primary metric instead, which lifts the baseline rate and shrinks the sample requirement.

The sample-size tools that matter. The DH A/B test sample size calculator on this site runs the calculation in 10 seconds. Evan Miller's open-source calculator is the long-running reference (evanmiller.org). Optimizely's stats engine documentation covers their always-valid inference frame and is worth reading even if you don't use Optimizely. The DH ecommerce profit calculator ties test results back to incremental contribution margin so you can size MDE against your actual operating economics.

The discipline that follows from the math: most ecom A/B tests at $1M-$10M scale are powered to detect 25-40 percent relative lifts and nothing smaller. Those are big lifts; they happen only on substantive structural changes (a new PDP layout, an unbundled-vs-bundled offer, a discount-stack rewrite, a checkout simplification). Micro-changes (button colour, headline tweaks, product-image swap) need to either be aggregated into a multivariate test or shipped on judgement and measured retrospectively, because the testing math doesn't support them.

Baseline CR 10% rel MDE 20% rel MDE 30% rel MDE
1.0%~62,000 / variant~15,500 / variant~6,900 / variant
1.5%~41,000 / variant~10,300 / variant~4,600 / variant
2.5%~24,500 / variant~6,100 / variant~2,700 / variant
5.0%~12,000 / variant~3,000 / variant~1,300 / variant
10.0%~5,700 / variant~1,400 / variant~620 / variant

Two-tailed test, alpha 0.05, power 0.8. Source: standard z-test approximation for two proportions; rounded for readability.

§ 04 · the hypothesis hierarchy

Four questions before any test.

Most CRO retainers skip the hypothesis stage and go straight to variants. The result is tests that win or lose without telling you anything about the next test.

Question one. What behaviour are we trying to change? Specific user behaviour, not a metric. "Visitors who add a product to cart but abandon at the cart page before reaching checkout." Not "conversion rate." A test that doesn't name the behaviour ends up testing the wrong page or the wrong moment in the journey.

Question two. What evidence makes us think this behaviour is changeable? Qualitative or quantitative grounding. Session replay of cart-abandoners showing a specific friction (e.g., a shipping cost surprise at checkout). Post-purchase survey data. Heatmap evidence. Cohort analysis. A test built on "I think this would be better" without grounding fails harder than a test built on weak evidence, because the team has nothing to learn from when the test loses.

Question three. What's the smallest change that could plausibly move the behaviour? Smaller change is faster to implement, easier to roll back, and isolates the cause if the test wins. The Big Game Sports cart-abandonment ribbon test (covered in our companion CRO piece and again in § 08 below) was deliberately scoped narrow: a single banner near the cart total, surfacing the free-shipping threshold, with no other change to the cart page. That isolation is what made the result interpretable.

Question four. What's the practical-significance threshold for shipping? Pre-registered. Typically 5 percent relative on a primary KPI for $1M-$10M brands; 2-3 percent for $50M+; 8-10 percent for sub-$1M where you want a clean signal before committing engineering work. If the test wins statistically but fails to clear the practical-significance threshold, the change does not ship even though "B beat A." The reverse case (losses inside the noise floor on tests of qualitatively superior variants) is harder to call; in practice we usually retest with a tighter sample after additional qualitative evidence.

Pre-register the four answers in a written test brief that the engineering and growth teams both sign before the variant is built. The brief lives in the same Linear ticket the engineering work attaches to. Tests without a written brief skip questions one and four most often, which is why they generate inconclusive results.

§ 05 · picking the right metric

Conversion rate alone misleads. Use a metric tree.

Most A/B tests are evaluated on conversion rate as the primary metric and revenue per visitor as a secondary. That works for some tests and breaks for many. The breakage happens when a variant pulls in more low-AOV buyers (the conversion rate goes up but RPV stays flat or drops), or when a variant filters out low-AOV buyers (conversion rate drops but RPV rises). Reading conversion rate alone in either case ships the wrong decision.

The fix is a metric tree, declared in advance. Primary metric: the one that drives the ship decision. Secondary metrics: the ones that protect against unintended consequences. Guardrail metrics: the ones that block ship if they move the wrong way regardless of primary movement (typical guardrails: page load time, error rate, return rate, refund rate, customer support contact rate). The tree shapes the decision rule before the data arrives, which is how you avoid post-hoc rationalisation.

For ecom at $1M-$10M, the metric trees that work look like this:

  • Tests on the funnel above checkout (PDP changes, collection changes, hero changes): primary is add-to-cart rate. Secondary is conversion rate and revenue per visitor. Guardrail is bounce rate and PDP-to-cart latency.
  • Tests on the cart and checkout (cart-page changes, checkout-step changes, payment-method visibility): primary is conversion rate or checkout completion rate. Secondary is AOV and revenue per visitor. Guardrail is checkout error rate, payment authorisation failure rate, and customer-support contact rate.
  • Tests on pricing and offers (discount stack, free-shipping threshold, bundle offers): primary is revenue per visitor or contribution margin per visitor. Secondary is conversion rate and AOV. Guardrail is return rate and refund rate at 30 days.
  • Tests on retention surfaces (subscription opt-in, account creation, post-purchase): primary is the long-term metric (90-day repeat rate, subscription cohort retention). Secondary is the short-term metric (immediate opt-in rate). Guardrail is unsubscribe rate, customer-support contact rate.

The Emani subscription cadence test, covered in § 08, ran on subscription-cohort retention at 90 days, not on day-0 opt-in rate, because the immediate metric had been gamed in a prior version of the test. Picking the long-horizon metric meant a longer test (the team waited 14 weeks for the 90-day cohort to mature), but the answer that came back was actionable in a way the day-0 answer wasn't.

One more wrinkle: segment-level metrics versus overall. The right primary metric is almost always the overall metric, with mobile vs desktop as the only pre-registered segment. Other segmentation (paid vs organic, new vs returning, US vs RoW) is exploratory unless the test explicitly powered for the segment, which most tests don't.

§ 06 · segmentation without p-hacking

Slice with discipline or don't slice.

Segmentation is where p-hacking happens most often in ecom A/B testing, usually accidentally. The pattern: a test wins flat overall, the team slices it by mobile vs desktop, finds mobile wins and desktop loses or vice versa, ships on the segment that won. The hazard: at p < 0.05 with 10 segments tested, you'd expect roughly 0.5 false-positive segments per test by chance alone. Across a year of testing that's 10-20 false segment wins shipped to production, which is the source of most "we A/B tested everything and our conversion rate didn't go up" complaints.

Two disciplines hold against this. First, pre-register the segments you'll evaluate. For ecom that's typically just two: mobile vs desktop. The two cuts diverge enough (mobile typically converts 1.5-2x lower than desktop, mobile users are more impatient with friction, desktop users browse longer) that running them as separate primary analyses is justified. Other segments (traffic source, region, new vs returning, hour of day, day of week) are exploratory unless the test was specifically powered for the segment, which most aren't.

Second, apply a multiple-comparisons correction. The Bonferroni correction is the simplest: divide alpha by the number of comparisons. If you test 10 segments at alpha 0.05, the corrected threshold is 0.005 per segment. Bonferroni is conservative; the Benjamini-Hochberg false-discovery-rate procedure is more powerful when you have many segments. The Wikipedia article on the multiple comparisons problem covers both. Most A/B testing platforms ignore this entirely, which is why segment results from those platforms should be treated as hypothesis generation, not decision support.

The honest framing: segmentation is for the next test, not this test. If a test wins flat overall but the data hints mobile drove most of the lift, the next test is a mobile-only test with a tighter MDE, sized for mobile traffic alone. That's the testing operations cadence; trying to slice your way to a ship decision from one underpowered overall test is the failure mode.

Two harder cases worth flagging. Heterogeneous treatment effects. Some users respond positively to a variant and others respond negatively, and the overall flat result hides both. Modern uplift modelling (covered briefly in the Wikipedia entry on uplift modelling) handles this case but requires far more data than a typical $1M-$10M brand can muster. Multi-armed bandit testing. Trades off statistical purity against revenue protection by gradually shifting traffic to the leading variant during the test, which is better for teams running many low-stakes tests and worse for teams that need clean post-test learnings. The Wikipedia article on the multi-armed bandit problem is the canonical reference.

§ 07 · testing infrastructure

Three layers. Right tool per layer.

Client-side, server-side, measurement. Most ecom teams pick one layer, run everything through it, and lose to flicker, ad-blockers, or velocity.

layer 01 · client-side

Visual variant orchestration

Front-end visual changes on PDP, collection, cart, hero, modals. Tools like Optimizely Web, VWO, Convert, Google Optimize replacement layer (Google Optimize sunset Sept 2023; the GA4-native ecosystem moved to integrations like Optimize 360 enterprise or third-party platforms).

Strengths: fast to deploy, easy to roll back, no engineering work needed for many tests. Weaknesses: client-side flicker on slow devices, ad-blocker bypass on roughly 15-25 percent of traffic depending on category, sample-ratio mismatches when assignment script lags.

layer 02 · server-side

Stack-level commerce changes

Pricing, discount stacks, shipping logic, checkout extensibility, eligibility rules. Tools: Shopify Functions for Shopify Plus stores, Next.js middleware for headless setups, edge-function platforms like Cloudflare Workers for non-Shopify rendering.

Strengths: zero flicker, ad-blocker-immune, runs on every request including API/PWA traffic. Weaknesses: longer deploy cycle, harder to roll back, requires engineering work per test (which makes it expensive on velocity).

layer 03 · measurement

GA4 + server-side eventing

GA4 is the measurement plane regardless of testing tool. Variant assignment ships as a custom dimension on every event. Server-side eventing via the GA4 Measurement Protocol recovers the 15-25 percent of events the client misses.

Strengths: free, integrates with most testing tools, attributes revenue back to variant. Weaknesses: GA4 sampling on long-window queries can erode test data; cross-device attribution gaps; reporting latency 24-48 hours on standard properties.

The pattern that works for $5M-$50M ecom: visual changes orchestrated client-side via Optimizely or VWO, structural changes orchestrated server-side via Shopify Functions or a Next.js middleware layer, GA4 carrying the test ID and variant ID as custom dimensions on every event with the events fired both client-side (immediate) and server-side (recovers ad-blocked traffic). Both event streams write to the same GA4 property; deduplication via event_id hashed from order ID + timestamp.

The pattern that doesn't work: running pricing tests on a client-side platform. The price flicker is visible to the user, the test exposure is therefore biased, and any regulatory exposure (different prices to different users without the policy disclosure required by US state pricing-discrimination laws) becomes harder to manage. Also the pattern of running visual tests on Shopify Functions: deploy cycles measured in days kill iteration velocity, and a CRO retainer on a server-side-only stack will ship 5-10x fewer tests than one on a hybrid stack.

For the GA4 wiring specifically. Declare the test ID as a user-scoped custom dimension (variant assignment is sticky to the user, not the session). Declare the variant ID as event-scoped (so each event carries the variant in case the user crosses sessions or devices). Configure Enhanced Ecommerce events (purchase, add_to_cart, begin_checkout, view_item) so the revenue side of the metric tree is wired natively. The GA4 documentation at developers.google.com is the source of truth for the event taxonomy.

§ 08 · three tests that moved real DTC numbers

Three patterns. Three real DH clients.

Anonymised where required, named where the brand consented. The point is the methodology, not the brand stories.

01

Emani Cosmetics · subscription cadence test

Hypothesis: moving the default subscription cadence from 30 days to 45 days will reduce day-30 unsubscribes (cosmetics customers were over-supplied at 30 days and cancelling rather than skipping) without dropping LTV. Primary metric: 90-day cohort retention. Secondary: day-0 opt-in rate, AOV per cycle. Sample plan: 14-week test on 6,000+ subscribers per arm, calculated to detect a 10 percent relative shift on 90-day retention. Outcome: 45-day cadence won on retention by a meaningful margin, day-0 opt-in held flat (no friction added), AOV per cycle held. The test became a structural change, not a one-off lift.

Why it worked: long-horizon primary metric. Day-0 opt-in had been the team's instinctive metric and it would have called the test a tie.

02

Big Game Sports · cart-abandonment ribbon

Hypothesis: cart-page abandonment was driven by free-shipping-threshold uncertainty (visitors couldn't see how close they were to free shipping). A single ribbon banner at the top of the cart, surfacing the threshold and remaining amount, will lift checkout conversion. Primary metric: cart-to-checkout conversion. Secondary: overall conversion rate, AOV. Sample plan: 4-week test, 14,000 cart-page visits per variant, MDE 8 percent relative. Outcome: ribbon won on cart-to-checkout, AOV held, overall conversion lifted as a downstream consequence. Shipped permanently.

Why it worked: narrow scope (one banner, no other cart-page changes), grounding in session-replay evidence of free-shipping confusion, primary metric matched the hypothesis.

03

Noble Paris · product-page social-proof stack

Hypothesis: trust signals on the PDP (review count, recent-purchase activity, shipping policy summary) were dispersed across the page; consolidating them into a single stack near the buy box will increase add-to-cart rate. Primary metric: add-to-cart. Secondary: conversion rate, time on PDP. Sample plan: 3-week test, 22,000 PDP sessions per variant, MDE 7 percent relative on ATC. Outcome: ATC lifted modestly (within the MDE), conversion rate lifted more (suggesting downstream confidence carry-through), time on PDP held. Shipped with a follow-up test isolating which elements of the stack drove the effect.

Why it worked: ATC chosen as primary because the volume supported it; conversion rate as secondary caught a downstream effect that wouldn't have powered as a primary. Follow-up test was already scoped before the first one shipped.

Three honest examples. None of these were 50-percent-conversion-lift wins; the headline lift on the Emani test was around 12 percent on retention, the Big Game ribbon was around 9 percent on cart-to-checkout, the Noble Paris stack was around 6 percent on conversion rate. Real ecom A/B test wins compound at this scale; the wins that headline at 50 percent are usually segment-only or under-powered.

§ 09 · five mistakes that kill validity

Five validity killers. Most teams hit two of them per test.

  1. Peeking and stopping early. Frequentist tests assume a fixed sample size set in advance. Looking at the data daily and stopping when the variant is ahead inflates false-positive rates from 5 percent to 20-30 percent. Either pre-commit to the sample-size duration, or use a sequential-testing framework (mSPRT, always-valid inference) designed for continuous monitoring.
  2. Sample-ratio mismatch (SRM). The two arms should receive roughly equal traffic; if one arm has materially more or less traffic than the other, the random-assignment assumption is broken and the result is invalid regardless of effect size. Run a chi-squared test on the variant traffic counts before reading the result. SRM usually points at a flicker bug, an ad-blocker interaction, or a bot-traffic problem.
  3. No pre-registered metric tree. When the team picks the metric after seeing the data, they pick the metric that won. This is post-hoc rationalisation dressed as analysis. Pre-register primary, secondary, and guardrail metrics in the test brief, signed before launch.
  4. Underpowered tests. Most ecom A/B tests at $1M-$10M scale run with 5,000-10,000 sessions per variant and are powered to detect 30+ percent relative lifts. They generate ties on smaller real effects and flag noise as wins on the few tests where the noise lined up. Run the power calculation; if the math doesn't pencil, don't run the test.
  5. Running too short or too long. Two-week minimum to cover a full weekday-weekend cycle. Four-week maximum to avoid seasonal drift, paid-campaign rotation, and product-launch contamination. Tests that don't fit inside that window aren't really single tests; they're observational studies of a slowly drifting traffic mix.
§ 10 · the 60-day operations cadence

A working testing operation, not a one-off audit.

A testing operations cadence at $1M-$10M scale runs roughly two tests in flight at any time, with a third in build and a fourth in research. Tests are roughly 2-4 weeks each, so the throughput is 8-16 tests a year on a tight programme. The cadence below is what we run with clients on the growth strategy retainer.

Days 1-7. Research. Session replay review, post-purchase survey reads, GA4 funnel-drop analysis, support-ticket review for friction signals. Output is a written hypothesis document with the four hypothesis-hierarchy answers from § 04, plus a candidate variant sketch.

Days 8-21. Build. Variant implementation. Client-side via the testing platform if visual; server-side via Shopify Functions or middleware if structural. GA4 wiring (custom dimensions for test ID and variant ID, server-side eventing for ad-blocked traffic recovery) verified before launch. Pre-launch checklist: SRM dry-run on QA traffic, power calculation re-confirmed, primary/secondary/guardrail metrics signed off.

Days 22-49. Observation. Test runs untouched for 2-4 weeks. Daily SRM check (just on traffic split, not on result). Weekly check-in but no decisions until the test reaches its planned sample size or the four-week cap. If a guardrail metric breaks (page errors, refund rate spikes), the test is paused and reviewed.

Days 50-60. Analysis and ship. Final read against pre-registered metric tree. Statistical significance evaluated on primary metric. Practical significance evaluated against the pre-registered threshold. Decision: ship, hold, or retest. Written post-test note added to the testing log. Next test moves from research to build; new test enters research.

The discipline that holds the cadence: every test has one written brief, one written post-test note, and one row in the testing log. The log is the growth-strategy team's source of truth for what's been tested, what won, and what's been learned. Without the log, repeated tests of the same hypothesis under different framing become inevitable, and the testing programme drifts into theatre.

For teams running this internally without an agency, the smallest viable stack is one of the following client-side tools (Optimizely Web, VWO, Convert), GA4 with Enhanced Ecommerce events, server-side eventing via Google Tag Manager Server-Side or a tool like Klaviyo's server-side connector for the email-side measurement, and a session-replay tool for qualitative grounding. Plus a written test brief template that covers the § 04 four questions and the § 05 metric tree.

For agencies considering whether to bring CRO in-house or hire it, the broader companion piece on benefits of hiring an ecommerce development agency covers the trade-off; the technical-SEO piece on Shopify SEO services covers the upstream traffic question that conversion-optimization sits on top of. Design-side considerations are covered in our web design service and the related UI/UX design service.

§ 11 · questions teams ask

Six honest answers.

What sample size do I need for an A/B test on an ecom site doing $1M-$10M a year?

Use a power calculation, not a heuristic. The inputs are baseline conversion rate, the minimum detectable effect (MDE) you care about, statistical significance (alpha, usually 0.05), and statistical power (1-beta, usually 0.8). At a 2.5 percent baseline conversion rate and a 10 percent relative MDE (meaning you want to detect 2.5 going to 2.75), the calculation lands at roughly 24,500 visitors per variant for a two-tailed test. At a 1.5 percent baseline (typical for cold-traffic landing pages) and the same 10 percent relative MDE the requirement climbs to about 41,000 per variant. Most ecom A/B tests run on 5,000-10,000 sessions per variant and are powered to detect a 30 percent relative lift at best; they call ties wins because they're underpowered. Practical implication for $1M-$10M brands: do not bother A/B testing micro-changes (headline tweaks, button colour) on the funnel below the cart. Reserve testing budget for changes large enough to throw a 15 percent or 20 percent relative MDE, or pool variants in a multivariate frame. The ab-test-sample-size tool on this site runs the calculation in 10 seconds.

When should I peek at A/B test results, and how do I avoid p-hacking?

Two honest answers depending on your statistics framework. Frequentist: never peek. The standard t-test assumes a fixed sample size set in advance, and looking at results mid-test then deciding to stop because the result looks promising inflates your false-positive rate from 5 percent to roughly 20-30 percent depending on how often you peek. If you must peek for operational reasons, use a sequential testing method like the Always Valid Inference family or mSPRT, which adjust significance thresholds for continuous monitoring. Bayesian: peek freely, but only act when the posterior probability of the variant being better crosses a pre-set threshold (typically 95 percent or 97.5 percent). The Bayesian frame is more peeking-tolerant by construction, but the prior selection matters and most platforms hide that decision. Practical guideline for ecom teams running tests in GA4 + a third-party platform: pre-register the test plan (hypothesis, primary metric, MDE, sample size, decision rule), set the test duration before launch, and ignore mid-test results unless you have a sequential test designed in.

What's the difference between statistical significance and practical significance in A/B testing?

Statistical significance is the answer to a narrow question: given the data, is the difference between A and B unlikely to be random noise. Practical significance is the answer to the business question: is the difference large enough that we should ship the change. The two diverge constantly. A test with 200,000 sessions per variant can declare a 0.3 percent relative lift statistically significant at p < 0.05, but a 0.3 percent lift on a $5M store is roughly $15K incremental annual revenue, which is below the cost of the engineering work to ship the change permanently and below the noise floor on weekly revenue variation. The right CRO discipline pre-registers a minimum effect size that's worth shipping (typically 5 percent relative on a primary KPI for $1M-$10M brands, lower for $50M+). If the test wins statistically but doesn't clear that bar, you keep the control. The other failure mode (a test loses statistically but the variant is qualitatively better and the loss is within the noise) is harder; that's where qualitative evidence (session replay, user interviews, post-purchase surveys) earns its keep.

How do I avoid p-hacking when segmenting A/B test results by traffic source or device?

Pre-register your segmentation plan or apply a multiple-comparisons correction. The hazard is straightforward: if you slice an underpowered overall result into 10 segments (mobile vs desktop, paid vs organic, US vs RoW, new vs returning), at p < 0.05 you'd expect roughly 0.5 false positives per slice purely by chance. Run that across enough tests and you produce a ranking of fake winners. Two disciplines hold against this. First, declare the primary segments in your test plan before launching: typically device (mobile vs desktop) for ecom because mobile and desktop user behaviour diverges materially. Treat any other segment as exploratory and do not ship decisions on it. Second, apply a multiple-comparisons correction like Bonferroni (divide alpha by the number of segments) or the more nuanced Benjamini-Hochberg false-discovery-rate procedure. The honest framing: segmentation is for hypothesis generation, not decision-making, unless the test is large enough and the segmentation is pre-registered.

Should I run A/B tests in GA4, in a third-party tool like Optimizely, or in Shopify Functions on the server side?

Use the layer that matches the change you're testing. Front-end visual changes (button copy, hero swap, image reorder) work well in client-side tools like Optimizely Web, VWO, or Convert: fast to deploy, easy to roll back, but flicker risk on slow-loading pages and ad-blocker blind spots that erode sample integrity. Back-end commerce changes (shipping rate logic, discount stack rules, cart eligibility, checkout extensibility) belong in Shopify Functions or Next.js middleware on the server side: no flicker, no client-side blocking, but slower to deploy and harder to roll back. GA4 is your measurement plane in either case; it's not a test orchestrator. The pattern that works for $5M-$50M ecom: visual tests in a client-side tool with GA4 + server-side events as the measurement, structural and pricing tests in Shopify Functions with server-side events as the measurement. Underneath both, a single source of truth for variant assignment that ties to the customer's eventual revenue (typically a custom dimension in GA4 carrying the test ID and variant). The pattern that doesn't work: running visual tests on a server-side platform (slow iteration) or running pricing tests on a client-side platform (flicker exposes the test to the user).

How long should an A/B test run on a typical $1M-$10M Shopify store?

At least one full business cycle, usually two weeks minimum, capped at four weeks. Two weeks because ecom traffic and conversion behaviour are weekday-vs-weekend asymmetric; one full week of each smooths out the cycle. Four weeks as a cap because traffic mix drifts (paid campaigns rotate, organic seasonality kicks in, returning customers behave differently from new customers). If your power calculation says you need eight weeks to hit the sample size, your MDE is too small for the traffic volume: either widen the MDE, choose a higher-funnel metric (add-to-cart instead of purchase), or accept that the change is not testable on this site at this revenue tier and ship it on engineering judgement. The hidden problem with tests that run too long is they collide with seasonal effects, holiday traffic, and product launches that aren't part of the test design; that's why a four-week cap protects test validity more than a longer test would help with sample size.

§ 12 · the next step

Bring your last six tests. We'll review them in 30 minutes.

A 30-minute conversion-optimization audit. We read your current test log, run a power-calculation pass on what's in flight, and flag the validity issues. Named lead engineer plus growth lead on the call, not a sales rep. Written audit returned within two business days. Prasun Anand on the call.