ShipEasy
Flags & ExperimentsMetrics

Metrics

Define what you measure — primary metrics, guardrails, outliers, segmentation, and how to keep results honest.

Production readyOn this page · 9 min readUpdated · May 3, 2026Works with · Server SDK

A metric is a number computed per user from a set of events. Every experiment has at least one primary metric (the thing you're trying to move) and ideally a couple of guardrails (things you don't want to break). Get these two choices right and most of your experimentation problems go away on their own.

Aggregation types

ShipEasy ships four aggregation functions. Each one collapses a user's event stream into a single number per analysis window:

TypeWhat it computes per userUse it forVariance
conversion1 if the event happened at least once, else 0Did they buy? Did they retain?Bounded p(1-p) — the friendliest.
countNumber of eventsSessions, page views, clicks.Long-tailed; outliers possible.
sumSum of a numeric event propertyRevenue, time spent, items added.Heavy-tailed; outliers a real problem.
meanAverage of a numeric event propertyOrder value, session length.Same as sum, plus zero-handling.

Conversion is the simplest and statistically the friendliest — the variance is bounded by p(1-p), so power calculations are cheap and the t-test behaves. Means and sums need more samples and benefit from outlier handling (see below).

When the metric is "no event"

For conversion, a user with zero matching events contributes a 0. For mean, you have a choice: include zero-event users at 0, or drop them entirely. The default is include"average revenue across all exposed users, including non-buyers" — because that's the question business stakeholders almost always actually ask. You can flip it with --zero-handling drop when you want "average order value among buyers".

Creating a metric

# binary conversion on `purchase`
shipeasy metrics create purchase_conversion --type conversion --event purchase

# revenue per user (includes non-buyers as $0)
shipeasy metrics create revenue_per_user --type sum --event purchase --property value

# average order value (across buyers only)
shipeasy metrics create avg_order_value --type mean --event purchase --property value --zero-handling drop

# sessions per user
shipeasy metrics create sessions --type count --event session_start

Or in the dashboard: Experiments → Metrics → New metric.

Filtering events into a metric

You can tighten what counts toward a metric with filters — same shape as gate targeting rules. Filters run against the event's properties payload before the aggregation:

shipeasy metrics create paid_purchase \
  --type conversion --event purchase \
  --filter '[{"attr":"channel","op":"eq","value":"organic"}]'

Now paid_purchase only counts purchase events whose channel == "organic" property is set on the event payload. You can compose multiple filters with implicit AND. Common shapes:

# Web purchases only (exclude mobile app)
--filter '[{"attr":"platform","op":"eq","value":"web"}]'

# Orders above $10 (exclude $0 promo redemptions)
--filter '[{"attr":"value","op":"gte","value":10}]'

# Multiple conditions
--filter '[
  {"attr":"platform","op":"eq","value":"web"},
  {"attr":"country","op":"in","value":["US","CA","GB"]}
]'

Primary vs guardrail

When you attach metrics to an experiment, you mark each one as primary or guardrail.

  • Primary: the metric you're trying to move. The experiment's "did it win?" decision is read off these. Pick one or two; if you have five primaries you have none.
  • Guardrail: metrics that must not regress. A new checkout flow had better not tank page load time. ShipEasy flags any guardrail that moves significantly in the wrong direction, even if your primary won. By default, the dashboard treats a guardrail as failed when p < 0.05 and the lift is in the bad direction.

A common pattern for a checkout experiment:

RoleMetricDirection
Primarypurchase_conversionup
Guardrailerror_ratedown
Guardrailp95_page_load_msdown
Guardrailsupport_ticket_rate (computed offline)down

The primary tells you "did the change work?", the guardrails tell you "did it work without breaking something else?". A win that ships is one where the primary moves and no guardrail regresses.

Outliers

For sum and mean, a single $50,000 enterprise purchase can swing the mean for thousands of users. ShipEasy supports two outlier handlers per metric:

  • Winsorise at a configurable percentile (default p99). Anything above is clamped to the p99 value of the combined sample. Default for sum and mean.
  • Cap at an absolute value. Use this when there's a domain-specific ceiling (e.g. a maximum plausible session length).
# Winsorise at p99 (default)
shipeasy metrics create revenue_per_user --type sum --event purchase --property value \
  --winsorise p99

# Hard cap at $1000
shipeasy metrics create revenue_per_user --type sum --event purchase --property value \
  --cap 1000

Winsorising is the default and is rarely wrong. The trade-off: clamping reduces variance (good — narrower CI, easier to detect lift) at the cost of slightly understating the true effect when the variant genuinely moves the tail (rare).

Outlier handling is symmetric across groups

The p99 cutoff is computed from the combined control + variant sample, then applied to both. This prevents the bug where one variant accidentally has its outliers preserved and looks artificially better.

Ratio metrics

Some questions are inherently ratios — "clicks per impression", "conversion per visit", "revenue per session". Define them with two events: a numerator and a denominator.

shipeasy metrics create click_through_rate \
  --type ratio \
  --numerator-event click \
  --denominator-event impression

Ratio metrics use the delta method to compute variance correctly (the naive ratio-of-means understates variance and inflates significance). The dashboard shows the per-user numerator and denominator alongside the ratio so the math is auditable.

Segmentation

Once results are in, slice them by user attribute:

purchase_conversion          control  v1     lift     p
  all                          4.8%   5.2%   +8.3%   0.018
  country = US                 5.1%   5.4%   +5.9%   0.041
  country = EU                 4.5%   4.9%   +8.9%   0.062
  plan = pro                   7.8%   9.4%   +20.5%  0.001
  plan = free                  3.9%   4.0%   +2.6%   0.401
  device = mobile              4.2%   4.9%  +16.7%   0.006
  device = desktop             5.6%   5.6%   +0.0%   0.974

Segmentation works on any attribute you register on the project. The platform stores attributes_at_exposure on the $exposure event, so segmentation is on the user's state at the moment they entered the experiment — a user who upgraded from free to pro mid-experiment stays in the free segment. This is the right behaviour: it's the only way segmentation is causal.

You don't need to ask for a segment up front — the platform recomputes them on demand from the existing exposure + event data, no re-bucketing required.

Beware HARKing

Hypothesising After Results are Known — combing through segments until you find one that's significant — gives false wins. With 20 segments and p < 0.05, you expect 1 false positive by pure chance. Pre-register the one or two segments you actually care about. Treat anything else as exploratory and label it as such on the dashboard.

Sample size and power

You can't detect a 1% lift with 100 users. ShipEasy shows you a power estimate before you start the experiment, given your traffic and the metric's historical variance — so you know roughly how long the test will need to run.

Quick rules of thumb (80% power, α = 0.05, two-sided):

Baseline conversionDetectable lift @ 80% powerUsers per arm
1%+10% relative~155,000
5%+5% relative~25,000
5%+10% relative~6,500
20%+5% relative~5,500
20%+10% relative~1,400
50%+5% relative~1,600
50%+10% relative~400

If your traffic is low, increase your effect-size requirement (don't bother shipping for 0.5% lift) or run the experiment longer. The platform will not stop you from running an underpowered test, but the dashboard will mark it insufficient_power and warn that the absence of a significant result doesn't mean the absence of an effect.

MDE (minimum detectable effect)

The flip side of "how many users do I need" is "given the users I have, what's the smallest lift I can reliably detect?" The dashboard surfaces an MDE alongside every running experiment, recomputed daily. If your MDE is +8% and the lift you're hoping to see is +2%, you're going to need to run the experiment for substantially longer — or accept that you won't be able to tell.

API · metrics.create

Field
Type
Description
namerequired
string
Stable identifier. Used in result rows and the CLI.
typerequired
"conversion" | "count" | "sum" | "mean" | "ratio"
Aggregation function applied per user.
eventrequired
string
Event name to aggregate. For ratio, use numerator_event + denominator_event instead.
property
string ?
Numeric property on the event to aggregate. Required for sum and mean.
filter
Rule[] ?
Same shape as a gate rule. Only events matching all rules are aggregated.
winsorise
"p95" | "p99" | "p99.9" ?
Percentile clip for sum/mean. Default p99.
cap
number ?
Absolute clip. Mutually exclusive with winsorise.
zero_handling
"include" | "drop" ?
How to count exposed users with no matching event. Default include (treats them as 0).
direction
"up" | "down" ?
Which direction is "good". Used to colour result rows and trigger guardrail alerts. Defaults to up.

Where to next

NEXT

Wire real events.

A metric is just an aggregation rule. The data that feeds it is your tracking calls — and there are exactly three things you need to get right.

Create a primary metric
$shipeasy metrics create purchase_conversion --type conversion --event purchase
Add a guardrail
$shipeasy metrics create error_rate --type conversion --event client_error --direction down
Was this page helpful?✎ Edit on GitHub

On this page