Metrics
Define what you measure — primary metrics, guardrails, outliers, segmentation, and how to keep results honest.
A metric is a number computed per user from a set of events. Every experiment has at least one primary metric (the thing you're trying to move) and ideally a couple of guardrails (things you don't want to break). Get these two choices right and most of your experimentation problems go away on their own.
Aggregation types
ShipEasy ships four aggregation functions. Each one collapses a user's event stream into a single number per analysis window:
| Type | What it computes per user | Use it for | Variance |
|---|---|---|---|
conversion | 1 if the event happened at least once, else 0 | Did they buy? Did they retain? | Bounded p(1-p) — the friendliest. |
count | Number of events | Sessions, page views, clicks. | Long-tailed; outliers possible. |
sum | Sum of a numeric event property | Revenue, time spent, items added. | Heavy-tailed; outliers a real problem. |
mean | Average of a numeric event property | Order value, session length. | Same as sum, plus zero-handling. |
Conversion is the simplest and statistically the friendliest — the variance is bounded by p(1-p), so power calculations are cheap and the t-test behaves. Means and sums need more samples and benefit from outlier handling (see below).
When the metric is "no event"
For conversion, a user with zero matching events contributes a 0. For mean, you have a choice: include zero-event users at 0, or drop them entirely. The default is include — "average revenue across all exposed users, including non-buyers" — because that's the question business stakeholders almost always actually ask. You can flip it with --zero-handling drop when you want "average order value among buyers".
Creating a metric
# binary conversion on `purchase`
shipeasy metrics create purchase_conversion --type conversion --event purchase
# revenue per user (includes non-buyers as $0)
shipeasy metrics create revenue_per_user --type sum --event purchase --property value
# average order value (across buyers only)
shipeasy metrics create avg_order_value --type mean --event purchase --property value --zero-handling drop
# sessions per user
shipeasy metrics create sessions --type count --event session_startOr in the dashboard: Experiments → Metrics → New metric.
Filtering events into a metric
You can tighten what counts toward a metric with filters — same shape as gate targeting rules. Filters run against the event's properties payload before the aggregation:
shipeasy metrics create paid_purchase \
--type conversion --event purchase \
--filter '[{"attr":"channel","op":"eq","value":"organic"}]'Now paid_purchase only counts purchase events whose channel == "organic" property is set on the event payload. You can compose multiple filters with implicit AND. Common shapes:
# Web purchases only (exclude mobile app)
--filter '[{"attr":"platform","op":"eq","value":"web"}]'
# Orders above $10 (exclude $0 promo redemptions)
--filter '[{"attr":"value","op":"gte","value":10}]'
# Multiple conditions
--filter '[
{"attr":"platform","op":"eq","value":"web"},
{"attr":"country","op":"in","value":["US","CA","GB"]}
]'Primary vs guardrail
When you attach metrics to an experiment, you mark each one as primary or guardrail.
- Primary: the metric you're trying to move. The experiment's "did it win?" decision is read off these. Pick one or two; if you have five primaries you have none.
- Guardrail: metrics that must not regress. A new checkout flow had better not tank page load time. ShipEasy flags any guardrail that moves significantly in the wrong direction, even if your primary won. By default, the dashboard treats a guardrail as failed when
p < 0.05and the lift is in the bad direction.
A common pattern for a checkout experiment:
| Role | Metric | Direction |
|---|---|---|
| Primary | purchase_conversion | up |
| Guardrail | error_rate | down |
| Guardrail | p95_page_load_ms | down |
| Guardrail | support_ticket_rate (computed offline) | down |
The primary tells you "did the change work?", the guardrails tell you "did it work without breaking something else?". A win that ships is one where the primary moves and no guardrail regresses.
Outliers
For sum and mean, a single $50,000 enterprise purchase can swing the mean for thousands of users. ShipEasy supports two outlier handlers per metric:
- Winsorise at a configurable percentile (default
p99). Anything above is clamped to the p99 value of the combined sample. Default forsumandmean. - Cap at an absolute value. Use this when there's a domain-specific ceiling (e.g. a maximum plausible session length).
# Winsorise at p99 (default)
shipeasy metrics create revenue_per_user --type sum --event purchase --property value \
--winsorise p99
# Hard cap at $1000
shipeasy metrics create revenue_per_user --type sum --event purchase --property value \
--cap 1000Winsorising is the default and is rarely wrong. The trade-off: clamping reduces variance (good — narrower CI, easier to detect lift) at the cost of slightly understating the true effect when the variant genuinely moves the tail (rare).
The p99 cutoff is computed from the combined control + variant sample, then applied to both. This prevents the bug where one variant accidentally has its outliers preserved and looks artificially better.
Ratio metrics
Some questions are inherently ratios — "clicks per impression", "conversion per visit", "revenue per session". Define them with two events: a numerator and a denominator.
shipeasy metrics create click_through_rate \
--type ratio \
--numerator-event click \
--denominator-event impressionRatio metrics use the delta method to compute variance correctly (the naive ratio-of-means understates variance and inflates significance). The dashboard shows the per-user numerator and denominator alongside the ratio so the math is auditable.
Segmentation
Once results are in, slice them by user attribute:
purchase_conversion control v1 lift p
all 4.8% 5.2% +8.3% 0.018
country = US 5.1% 5.4% +5.9% 0.041
country = EU 4.5% 4.9% +8.9% 0.062
plan = pro 7.8% 9.4% +20.5% 0.001
plan = free 3.9% 4.0% +2.6% 0.401
device = mobile 4.2% 4.9% +16.7% 0.006
device = desktop 5.6% 5.6% +0.0% 0.974Segmentation works on any attribute you register on the project. The platform stores attributes_at_exposure on the $exposure event, so segmentation is on the user's state at the moment they entered the experiment — a user who upgraded from free to pro mid-experiment stays in the free segment. This is the right behaviour: it's the only way segmentation is causal.
You don't need to ask for a segment up front — the platform recomputes them on demand from the existing exposure + event data, no re-bucketing required.
Hypothesising After Results are Known — combing through segments until you find one that's significant — gives false wins. With 20 segments and p < 0.05, you expect 1 false positive by pure chance. Pre-register the one or two segments you actually care about. Treat anything else as exploratory and label it as such on the dashboard.
Sample size and power
You can't detect a 1% lift with 100 users. ShipEasy shows you a power estimate before you start the experiment, given your traffic and the metric's historical variance — so you know roughly how long the test will need to run.
Quick rules of thumb (80% power, α = 0.05, two-sided):
| Baseline conversion | Detectable lift @ 80% power | Users per arm |
|---|---|---|
| 1% | +10% relative | ~155,000 |
| 5% | +5% relative | ~25,000 |
| 5% | +10% relative | ~6,500 |
| 20% | +5% relative | ~5,500 |
| 20% | +10% relative | ~1,400 |
| 50% | +5% relative | ~1,600 |
| 50% | +10% relative | ~400 |
If your traffic is low, increase your effect-size requirement (don't bother shipping for 0.5% lift) or run the experiment longer. The platform will not stop you from running an underpowered test, but the dashboard will mark it insufficient_power and warn that the absence of a significant result doesn't mean the absence of an effect.
MDE (minimum detectable effect)
The flip side of "how many users do I need" is "given the users I have, what's the smallest lift I can reliably detect?" The dashboard surfaces an MDE alongside every running experiment, recomputed daily. If your MDE is +8% and the lift you're hoping to see is +2%, you're going to need to run the experiment for substantially longer — or accept that you won't be able to tell.
API · metrics.create
ratio, use numerator_event + denominator_event instead.sum and mean.sum/mean. Default p99.include (treats them as 0).up.Where to next
Events→
What goes into a metric, and how to log it.
How analysis works→
From raw events to lift, p-value, CI.
User attributes→
Pass enough about the user that segmentation is rich.
Wire real events.
A metric is just an aggregation rule. The data that feeds it is your tracking calls — and there are exactly three things you need to get right.