A variant on an A/B test is ahead after a week, and someone wants to call it. Statistical significance is the threshold that says whether the lead is large enough to treat as a real effect rather than random variation between two equivalent buckets. The convention in DTC testing is a p-value below 0.05: the observed gap would occur fewer than 5% of the time if the variant and the control were truly identical. Below the threshold, the result is read as signal; above it, the test has not yet decided.
What the p-value and confidence level mean
The p-value answers a specific question: if the variant and control had no real difference, how often would an experiment produce a result at least this extreme? A p-value of 0.03 means a result this favorable to the variant would appear in 3% of reruns of a truly flat experiment. The 95% confidence level is the inverse — 1 − 0.05.
A common misreading: “95% confidence” is not “95% probability the variant is better than the control.” The frequentist machinery does not assign a probability to the variant being better; it tells you how compatible the observed data is with the no-effect assumption. A 95% confidence interval contains the true effect in 95% of repeated experiments — it is not a 95% chance that this particular interval contains it.
Significance depends on sample size and expected lift
A test on low-traffic pages or a small expected lift can run for weeks without resolving. The threshold is fixed; the experiment’s ability to clear it is not. Whether a test can detect the lift the operator hopes for is set by traffic, baseline conversion rate, and how small a difference the test is asked to see — the minimum detectable effect carries the mechanic. A “not significant” read on a small storefront is often an under-powered test, not a flat variant.
The operator pitfalls
Three failure modes recur on DTC test programs.
Three failure modes that recur
-
Peeking
Checking a running test and stopping the moment a variant looks ahead inflates false positives substantially. The 5% threshold assumes one decision at the planned end of the test; once a team can stop early on a favorable read, the effective error rate climbs well above 5%. Sequential-testing methods address this; the standard p-value does not. -
Parallel tests without correction
Run 20 simultaneous tests where nothing is really happening and, by chance, about one will read significant at p < 0.05. The threshold is per-test, not per-program. Multiple-comparisons corrections (Bonferroni and similar) tighten the per-test threshold to hold the program-wide rate. -
"Directionally significant."
Not a real category. The test cleared the threshold or it did not. Calling a p = 0.12 result “directionally significant” is how teams ship variants that have not earned the call.
Significance is necessary, not sufficient
A stat-sig win on CVR can still be a revenue loss if AOV regressed, or a margin loss if the variant shifted the mix toward discounted SKUs. The threshold says the effect on the primary metric is real; it does not say the business outcome is positive. The operator move is to pre-register the primary metric — usually revenue per session, not CVR — and treat significance as the entry ticket for a decision, not the decision itself.