An A/B test is a randomized controlled experiment. Traffic is split between a control experience and a variant by a deterministic hash, and the difference in a pre-declared primary metric between the two arms is read as the causal effect of the change. Randomization is what licenses the causal claim — without it, the gap between arms confounds with selection.
Before launch, an operator names six things: the control, one or more variants, the primary metric, the unit of randomization, the traffic split, and the runtime. The unit is usually visitor, user, or session — visitor-level holds assignment across sessions, session-level resets, and the two answer different causal questions. Pre-declaring the primary metric prevents HARKing (hypothesizing after results are known) and p-hacking by metric shopping. Statistical significance and minimum detectable effect govern whether the answer is trustworthy and how small a lift it can resolve.
On a Shopify-shaped storefront, A/B tests typically run inside a third-party experimentation platform — Convert, VWO, or Optimizely Web for UX and copy, Intelligems for pricing and promo. Operators read the result in the analytics tool against full revenue, especially when the primary metric is checkout conversion that feeds CAC math.
Failure modes on small DTC traffic are predictable. Under-powered tests, where realistic lift is smaller than the minimum detectable effect, often return inconclusive reads. Sample ratio mismatch — observed split deviating from intended — signals an assignment bug and invalidates the run. Peeking inflates false positives. Novelty effects flatter the variant early, then decay. A sale weekend inside the window mixes seasonality into the treatment.
A/B testing is the wrong tool for paid-media incrementality, where platform-level contamination breaks assignment — use a geo-lift test or a holdout read against MER. It also fails when one arm’s experience leaks into the other (network effects, shared inventory); a switchback design fits better.