A/B and multivariate experiments with managed stats. statistical analysis, holdouts, segmentation — out of the box.

ShipEasy · Experiments

Was the new thing actually better?

Define a metric, ship a variant, get a defensible answer the next morning. statistical analysis, lift, p-value and 95% CI — written back to your project.

Quickstart→Building blocks

Production readyOn this page · 6 min readUpdated · May 3, 2026Works with · Server SDK

ShipEasy Experiments lets you ask "is the new thing actually better?" and get a defensible answer.

You define what you want to measure (a metric), set up variant payloads (the groups), let users land in one bucket or another, log conversions, and look at the dashboard the next morning. The platform handles assignment, exposure stitching, deduplication, daily aggregation, and the statistics.

The five building blocks

◇

Universe→

A logical namespace. Carries the holdout (a global control unaffected by every experiment in the universe) and an optional mutual-exclusion rule.

holdout·mutual exclusion

◈

Experiment→

A name, an allocation %, a list of groups with weights and params. Optionally a targeting gate.

weighted groups·typed params

∑

Metrics→

count, sum, mean, or conversion over an event type. Pick a primary, and a couple of guardrails.

conversion·count · sum · mean

✎

Events→

Anything you track(). Stored in events store, aggregated into metric values daily.

events store·fire-and-forget

∫

Analysis→

statistical analysis, lift, p-value, 95% CI. Computed once per day, persisted to your project, surfaced in the dashboard.

Welch t-test·p · CI · lift

The 5-minute mental model

You ship feature X behind a flag. You believe it'll lift purchase_conversion. So instead of just gating it, you make it an experiment: half the eligible users see X, half don't. Both groups are tracked. The next day, you look at the dashboard and read either "X lifts conversion by 3.1% (p=0.002)" — or "no significant difference, scrap it."

The honest version requires more care than that summary, and the docs walk through it.

When to gate vs. when to experiment

Situation	Gate	Experiment
Is the change reversible?	✓	✓
Does it have a known clear win (security fix, bug)?	✓	–
Do you need an answer to "did it work"?	–	✓
Cheap to ship, expensive to undo (data migration)?	✓	–
Affects a metric you're paid to move?	–	✓

A practical workflow is to gate first (so the rollout is safe), then promote the gate to an experiment once the feature is stable.

The 5-minute path

▶

From metric to result

~5 minutes setup · 24h to first stats

01 · METRIC

Define what success looks like

One number. A primary, plus a couple of guardrails (latency, error rate).

$shipeasy metrics create purchase_conversion --type conversion --event purchase

02 · EXPERIMENT

Two groups, equal weight

100% allocation, 50/50 split, two label variants. Targeting via a gate is optional.

$shipeasy experiments create checkout-cta

03 · WIRE

assign + track

First call logs an exposure. flags.track() on conversion. Daily aggregation does the math.

$experiments.assign('checkout-cta', user)

Wire the SDK

$npm install @shipeasy/sdk

import { configureShipeasy, flags, experiments } from "@shipeasy/sdk/server";

configureShipeasy({ apiKey: process.env.SHIPEASY_SERVER_KEY! });
await flags.init();

const result = experiments.assign<{ label: string }>("checkout-cta", {
  user_id,
  plan,
  country,
});
const label = result.params.label; // "Pay" or "Buy now"

// Wherever the purchase succeeds:
flags.track(user_id, "purchase", { value: orderTotal });

The first time you call assign() for a user, an exposure is logged automatically. Subsequent calls in the same process don't re-log — exposures are deduplicated.

API · `experiments.assign`

Field

Type

Description

namerequired

string

The experiment name. Stable identifier, used in URLs and result rows.

userrequired

EvalContext

An object with at least user_id. Add attributes used by the targeting gate (plan, country, …).

defaultParams

T ?

Returned when the experiment is stopped, the user is in the holdout, or assignment fails. Defaults to the control group's params.

The return shape:

{
  inExperiment: boolean,   // false if stopped, holdout, or excluded
  group: string,           // "control" | "v1" | …
  params: T,               // typed payload from the assigned group
  reason: AssignmentReason // "assigned" | "holdout" | "stopped" | "excluded"
}

What the daily analysis does

Cron enqueues a job per project

A scheduled trigger on the ShipEasy fans out one queue message per project that has running experiments.

Consumer scans events store

For each project, the consumer pulls yesterday's exposures and events from Analytics Engine, joins by user_id, and aggregates per metric × group.

statistical analysis

Lift, two-sided p-value, 95% CI. Per metric, per group, vs. the control group.

Persist results

Results are persisted to your project, scoped to the project. The dashboard reads them; you can export them.

Don't peek and stop early

The p-values are valid for fixed-horizon tests. If you sneak a look every hour and stop the moment something turns significant, you'll get false wins. Pre-decide the experiment's duration based on traffic.

Don't change variants mid-flight

Changing params on a running experiment invalidates the analysis. Stop the experiment, create a new one with v2 in the name.

▲ READY?

Run your first experiment.

A complete walk-through: define the metric, create the experiment, wire two SDK calls, read the result the next morning.

Open the quickstart →How analysis works

Create the metric