How analysis works
From raw events to lift, p-value, and CI — exactly what the platform computes, why, and how to interpret it without fooling yourself.
Analysis is the part of an experiment platform that is most often hand-waved. This page is a precise description of what ShipEasy computes for you, in language that doesn't require a stats degree, and an honest description of the assumptions behind it.
The pipeline
Daily aggregation fires
Once per day (default 03:00 UTC, configurable per project), ShipEasy starts an analysis pass for every active experiment. Failed runs are retried automatically.
Per-user aggregation
For each (experiment, user) pair, ShipEasy joins the user's events to their exposure timestamp and computes the metric value. Events that happened before exposure don't count — that's the rule that makes results causal.
Per-group statistics
For each (metric, group), ShipEasy computes mean, variance, and N. Outlier handling (winsorisation or capping) is applied symmetrically across all groups in the comparison.
Welch's t-test
For each variant against the control group (and against the universe holdout, when one exists), ShipEasy runs Welch's t-test and computes lift, two-sided p-value, and 95% confidence interval.
Persist results
Results are persisted to your project, one row per (experiment, metric, group, segment, day). The dashboard reads from these rows; the CLI does too. Export anytime via the API.
The whole pipeline is idempotent on the (experiment, metric, group, day) key — if an analysis pass retries, it overwrites the row, never duplicates it.
What a result row contains
For each (experiment, metric, group, segment, day):
| Column | Meaning |
|---|---|
n | Number of users in the group with at least one exposure event in the window. |
mean | The metric's average value across those users. |
variance | Sample variance, used for the t-test. |
lift_rel | Relative lift vs control: (mean - control_mean) / control_mean. |
lift_abs | Absolute lift: mean - control_mean. |
p_value | Two-sided p-value from Welch's t-test. |
ci_low, ci_high | 95% confidence interval bounds for absolute lift. |
df | Degrees of freedom (Satterthwaite approximation, used for the t-distribution). |
status | ok, insufficient_data, insufficient_power, or srm_warning. |
The dashboard's pretty headline ("+8.3% (p=0.018)") is a render of these columns. Everything is queryable raw:
shipeasy experiments status checkout-cta --json > result.jsonA typical result block:
purchase_conversion · last 7 days
control N=12,418 rate=4.8% var=0.0457
v1 N=12,503 rate=5.2% var=0.0492 lift +8.3% p=0.018 CI [+0.07%, +0.73%]
holdout N= 658 rate=4.7% var=0.0448 lift +10.6% p=0.041 CI [+0.04%, +0.91%]Why Welch's t-test
Welch's is the default because:
- It does not assume equal variances between groups (Student's t-test does, and gets it wrong when traffic is unbalanced or when one variant materially shifts variance — both common in real experiments).
- It does not assume equal N between groups (which matters when allocation isn't 50/50, or when SRM happens, or when the holdout is 5% vs 95% non-holdout).
- It works for the metric types we ship:
conversion(treated as Bernoulli; the central limit theorem covers normality after a few thousand samples),count/sum/mean(CLT applies after ~1,000 users for non-pathological distributions, after ~10,000 for heavy-tailed ones). - It's well-understood, easy to explain, and matches the math behind the headline numbers from most major experiment platforms (Optimizely, Statsig, Eppo, GrowthBook). When a stakeholder asks "why doesn't your number match the one in our analytics tool?" the answer is almost always different windows, different exposure semantics, or different outlier handling — not different statistics.
For very small samples (N < 30 per arm), the platform widens the CI using the Satterthwaite degrees-of-freedom approximation. Below N = 10 per arm, results are marked insufficient_data and not surfaced as decisions.
Why not Bayesian?
A reasonable alternative is a Bayesian approach: posterior distributions, "probability variant beats control = 96%". It has nice properties — peeking is less of a sin, the output is more intuitive. We don't ship it today for two reasons:
- Calibration depends on prior choice. A weakly-informative prior on a high-traffic experiment converges to roughly the same answer as the t-test, but a wrong prior on a low-traffic experiment can mislead more confidently than a frequentist test does.
- Most teams know how to read p-values. The cost of educating a team on a new framework, when the underlying decisions are the same, isn't worth it. We may add a Bayesian view later for advanced users; we won't make it the default.
Reading a p-value
p_value is "the probability you'd see a difference at least this large by chance, assuming the variants are actually identical." Conventional thresholds:
| p-value | Interpretation |
|---|---|
< 0.01 | Strong evidence of a real effect. |
< 0.05 | Evidence of a real effect. The standard threshold. |
< 0.10 | Suggestive. Useful for exploratory work, not for shipping. |
>= 0.10 | No statistically significant difference. |
Critical caveats:
- p-values are valid for fixed-horizon designs. Decide your duration up front. Stopping the moment something turns significant inflates false-positive rates dramatically — under continuous peeking with a stop-when-significant rule, the true false-positive rate at nominal
α=0.05rises to>30%. - p-values measure detection probability, not effect size. A
+0.1%, p=0.001result is real but probably not worth shipping. Read the CI before you ship. - Multiple comparisons inflate false-positive rate. If you run 20 experiments at
p<0.05, you'll false-win 1 of them by chance. If you check 20 segments inside one experiment, same problem. Pre-register hypotheses; treat post-hoc segments as exploratory. - The null is "no difference", not "the variant is bad". A non-significant result is "we don't have evidence", not "we have evidence of no effect". Use the CI to bound the effect.
Reading a confidence interval
The 95% CI is "a range of values consistent with the data at the 95% confidence level". If the CI is [+1.4%, +15.2%], you can say "the true lift is plausibly anywhere in this range; the data don't pin it down further".
CIs answer the question p-values don't: how big is the effect, plausibly?
- A CI of
[+0.05%, +20%]is technically significant (excludes 0) but wide — you don't actually know if the lift is trivial or large. More N narrows the CI. - A CI of
[+4%, +6%]is narrow and excludes 0 — strong, well-bounded evidence. Ship it. - A CI of
[-2%, +12%]is wide and includes 0 — the experiment is inconclusive, not "no effect". Run it longer or ship if the downside risk is acceptable.
Sample size & MDE intuition
Power calculation is the inverse of the t-test: given a baseline rate, a desired lift, and a desired confidence, how many users per arm? The arithmetic is n ∝ σ² / δ² — variance over effect size squared. Practical implications:
- Halving the lift you want to detect quadruples the required N. Doubling traffic doesn't double the lift you can detect — it cuts it by
√2 ≈ 1.4×. - Conversion baseline matters. Detecting
+10%relative on a 50% baseline is way easier than on a 1% baseline because the variance shrinks at the extremes (p(1-p)peaks at0.5). - For revenue metrics, variance is the killer. A long-tailed
revenue_per_userdistribution can need 5–10× more N than the equivalent conversion metric. Winsorisation helps; switching to a less variable metric (e.g.purchase_conversioninstead ofrevenue_per_user) helps more.
The dashboard surfaces an MDE (minimum detectable effect) alongside every running experiment, recomputed daily as data arrives. If your MDE after 7 days is +8% and the lift you're hoping to see is +2%, you'll need roughly (8/2)² = 16× more data — or you should accept that the experiment can't answer the question you asked.
Exposure stitching
A user is exposed the first time experiments.assign() runs for them. The $exposure event captures (experiment, group, user_id, timestamp, attributes_at_exposure).
Conversion events after that timestamp count. Events before it don't. This is what makes the analysis causal — we're measuring "behaviour after the user saw the variant", not "all behaviour for users in this bucket".
Concretely:
user u_42 timeline:
10:00 page_view <- before exposure, doesn't count
10:05 page_view
10:07 $exposure (v1) <- bucketed into v1
10:09 click <- counts toward v1
10:14 purchase value=49 <- counts toward v1If a user's anonymous_id later gets aliased to a user_id (client.alias()), exposures are stitched: pre-login behaviour is attributed to the right variant for the right user. The aliasing is applied retroactively at analysis time, so you don't need to log everything as user_id to get correct results.
Sample Ratio Mismatch (SRM)
If you allocated 50/50 but the actual exposure counts come back 51,200 / 48,800, that's almost certainly a bucketing bug, a redirect that drops one variant, or a tracking gap. The dashboard runs an SRM chi-squared test on every result and flips status = srm_warning when the imbalance is significant beyond chance (p < 0.001 on the ratio).
Don't trust an experiment with an SRM warning. It almost always means one variant's users are being filtered out somewhere upstream, which biases the comparison in ways the t-test can't fix.
Where stats stop and judgement starts
ShipEasy gives you the numbers. It does not tell you whether to ship. You should:
- Trust
p < 0.05+ CI excludes 0 + direction matches your hypothesis + no guardrail regressed + no SRM warning as a green light. - Trust multiple weeks of data, not "we just hit significance in hour 3".
- Be skeptical of segments you didn't pre-register.
- Sanity-check effect size against priors. A 30% lift on a button label is suspicious — check the wiring before you celebrate. A 2% lift on a layout change is plausible. If a result looks too good to be true, it usually is.
- Look at multiple metrics. A primary win that comes with a guardrail regression is a draw, not a win.
- Consider novelty effects. New things often look better in the first week and revert. For visible UI changes, run for at least 2 weeks before reading.
Exporting raw aggregates
Everything analysis produces is queryable. The CLI:
shipeasy experiments status checkout-cta --json > result.json
shipeasy experiments status checkout-cta --csv > result.csv
shipeasy experiments status checkout-cta --segment country=USOr pull straight from the API:
curl -H "Authorization: Bearer $SHIPEASY_API_KEY" \
https://shipeasy.ai/api/experiments/checkout-cta/resultsOr read via the API if you've enabled the read access on your project:
CLI d1 execute YOUR_DB --command \
"SELECT * FROM experiment_results WHERE experiment='checkout-cta' AND day >= date('now','-7 days')"Pipe it into your warehouse if you want; we won't make you live in our dashboard.
API · result row
control, a variant name, or holdout.country=US. null for the all-users row.The p-values are valid for fixed-horizon tests. If you sneak a look every hour and stop the moment something turns significant, you'll get false wins — the realised false-positive rate under continuous peeking can exceed 30% even when the nominal threshold is 5%. Pre-decide the experiment's duration based on traffic.
Changing params on a running experiment invalidates the analysis — the users exposed before the
change saw a different thing than those exposed after, but they're pooled into the same group.
Stop the experiment, create a new one with v2 in the name.
Stop reading. Start shipping.
You know what the platform computes, what to trust, and what to look at before you ship. The next experiment is the only thing left.