Experiments — A/B tests with stats built in
A/B and multivariate experiments with managed stats. statistical analysis, holdouts, segmentation — out of the box.
Was the new thing actually better?
Define a metric, ship a variant, get a defensible answer the next morning. statistical analysis, lift, p-value and 95% CI — written back to your project.
ShipEasy Experiments lets you ask "is the new thing actually better?" and get a defensible answer.
You define what you want to measure (a metric), set up variant payloads (the groups), let users land in one bucket or another, log conversions, and look at the dashboard the next morning. The platform handles assignment, exposure stitching, deduplication, daily aggregation, and the statistics.
The five building blocks
Universe→
A logical namespace. Carries the holdout (a global control unaffected by every experiment in the universe) and an optional mutual-exclusion rule.
Experiment→
A name, an allocation %, a list of groups with weights and params. Optionally a
targeting gate.
Metrics→
count, sum, mean, or conversion over an
event type. Pick a primary, and a couple of guardrails.
Events→
Anything you track(). Stored in events store, aggregated into metric
values daily.
Analysis→
statistical analysis, lift, p-value, 95% CI. Computed once per day, persisted to your project, surfaced in the dashboard.
The 5-minute mental model
You ship feature X behind a flag. You believe it'll lift purchase_conversion. So instead of just gating it, you make it an experiment: half the eligible users see X, half don't. Both groups are tracked. The next day, you look at the dashboard and read either "X lifts conversion by 3.1% (p=0.002)" — or "no significant difference, scrap it."
The honest version requires more care than that summary, and the docs walk through it.
When to gate vs. when to experiment
| Situation | Gate | Experiment |
|---|---|---|
| Is the change reversible? | ✓ | ✓ |
| Does it have a known clear win (security fix, bug)? | ✓ | – |
| Do you need an answer to "did it work"? | – | ✓ |
| Cheap to ship, expensive to undo (data migration)? | ✓ | – |
| Affects a metric you're paid to move? | – | ✓ |
A practical workflow is to gate first (so the rollout is safe), then promote the gate to an experiment once the feature is stable.
The 5-minute path
From metric to result
~5 minutes setup · 24h to first statsDefine what success looks like
One number. A primary, plus a couple of guardrails (latency, error rate).
Two groups, equal weight
100% allocation, 50/50 split, two label variants. Targeting via a gate is optional.
assign + track
First call logs an exposure. flags.track() on conversion. Daily aggregation does the math.
Wire the SDK
import { configureShipeasy, flags, experiments } from "@shipeasy/sdk/server";
configureShipeasy({ apiKey: process.env.SHIPEASY_SERVER_KEY! });
await flags.init();
const result = experiments.assign<{ label: string }>("checkout-cta", {
user_id,
plan,
country,
});
const label = result.params.label; // "Pay" or "Buy now"// Wherever the purchase succeeds:
flags.track(user_id, "purchase", { value: orderTotal });The first time you call assign() for a user, an exposure is logged automatically. Subsequent calls in the same process don't re-log — exposures are deduplicated.
API · experiments.assign
user_id. Add attributes used by the targeting gate (plan, country, …).The return shape:
{
inExperiment: boolean, // false if stopped, holdout, or excluded
group: string, // "control" | "v1" | …
params: T, // typed payload from the assigned group
reason: AssignmentReason // "assigned" | "holdout" | "stopped" | "excluded"
}What the daily analysis does
Cron enqueues a job per project
A scheduled trigger on the ShipEasy fans out one queue message per project that has running experiments.
Consumer scans events store
For each project, the consumer pulls yesterday's exposures and events from Analytics
Engine, joins by user_id, and aggregates per metric × group.
statistical analysis
Lift, two-sided p-value, 95% CI. Per metric, per group, vs. the control group.
Persist results
Results are persisted to your project, scoped to the project. The dashboard reads them; you can export them.
The p-values are valid for fixed-horizon tests. If you sneak a look every hour and stop the moment something turns significant, you'll get false wins. Pre-decide the experiment's duration based on traffic.
Changing params on a running experiment invalidates the analysis. Stop the experiment, create a
new one with v2 in the name.
Run your first experiment.
A complete walk-through: define the metric, create the experiment, wire two SDK calls, read the result the next morning.