ecommerce

Statistical Significance

The threshold at which the observed difference between an A/B test's variant and its control is unlikely enough under the no-effect assumption to be treated as a real effect rather than random variation — conventionally a p-value below 0.05, equivalent to a 95% confidence level.

Also known as: Stat Sig, p-value, Significance Level, Confidence Level

A variant on an A/B test is ahead after a week, and someone wants to call it. Statistical significance is the threshold that says whether the lead is large enough to treat as a real effect rather than random variation between two equivalent buckets. The convention in DTC testing is a p-value below 0.05: the observed gap would occur fewer than 5% of the time if the variant and the control were truly identical. Below the threshold, the result is read as signal; above it, the test has not yet decided.

What the p-value and confidence level mean

The p-value answers a specific question: if the variant and control had no real difference, how often would an experiment produce a result at least this extreme? A p-value of 0.03 means a result this favorable to the variant would appear in 3% of reruns of a truly flat experiment. The 95% confidence level is the inverse — 1 − 0.05.

A common misreading: “95% confidence” is not “95% probability the variant is better than the control.” The frequentist machinery does not assign a probability to the variant being better; it tells you how compatible the observed data is with the no-effect assumption. A 95% confidence interval contains the true effect in 95% of repeated experiments — it is not a 95% chance that this particular interval contains it.

Significance depends on sample size and expected lift

A test on low-traffic pages or a small expected lift can run for weeks without resolving. The threshold is fixed; the experiment’s ability to clear it is not. Whether a test can detect the lift the operator hopes for is set by traffic, baseline conversion rate, and how small a difference the test is asked to see — the minimum detectable effect carries the mechanic. A “not significant” read on a small storefront is often an under-powered test, not a flat variant.

The operator pitfalls

Three failure modes recur on DTC test programs.

Three failure modes that recur

  1. Peeking

    Checking a running test and stopping the moment a variant looks ahead inflates false positives substantially. The 5% threshold assumes one decision at the planned end of the test; once a team can stop early on a favorable read, the effective error rate climbs well above 5%. Sequential-testing methods address this; the standard p-value does not.
  2. Parallel tests without correction

    Run 20 simultaneous tests where nothing is really happening and, by chance, about one will read significant at p < 0.05. The threshold is per-test, not per-program. Multiple-comparisons corrections (Bonferroni and similar) tighten the per-test threshold to hold the program-wide rate.
  3. "Directionally significant."

    Not a real category. The test cleared the threshold or it did not. Calling a p = 0.12 result “directionally significant” is how teams ship variants that have not earned the call.

Significance is necessary, not sufficient

A stat-sig win on CVR can still be a revenue loss if AOV regressed, or a margin loss if the variant shifted the mix toward discounted SKUs. The threshold says the effect on the primary metric is real; it does not say the business outcome is positive. The operator move is to pre-register the primary metric — usually revenue per session, not CVR — and treat significance as the entry ticket for a decision, not the decision itself.

Related terms