Quickstart
Run your first A/B test from scratch — define a metric, create the experiment, ship variants, read results.
This walks through a complete experiment end to end: a button-label test on the checkout CTA. We'll set up a conversion metric, create the experiment with two variants at 50/50, instrument the SDK, and read results.
Your first experiment
~5 min setup · 24h to first statsDefine the conversion
Two groups, equal weight
assign + track
Prerequisites
- The SDK installed and configured. See Install and How it works.
- A metric in mind. Here: did the user
purchaseafter seeing checkout?
Define the conversion metric
A metric wraps an event type and an aggregation function. For "did this user buy?" we use conversion over the purchase event.
shipeasy metrics create purchase_conversion \
--type conversion \
--event purchaseOr in the dashboard: Experiments → Metrics → New metric.
Tip: also add guardrails — secondary metrics that should not regress. For checkout, page load time and error rate are good guardrails.
Create the experiment
Two groups, equal weight, one parameter (label):
shipeasy experiments create checkout-cta \
--allocation 100 \
--groups '[
{"name":"control","weight":50,"params":{"label":"Pay"}},
{"name":"v1","weight":50,"params":{"label":"Buy now"}}
]' \
--params '{"label":"string"}'--allocation 100 means 100% of eligible users are in the experiment, split 50/50. Drop allocation to e.g. 20 if you want only a 20% sample of traffic to see any variant — the rest get neither, useful for a soft launch.
Attach purchase_conversion as the primary metric in the dashboard.
Wire the SDK
import { configureShipeasy, flags, experiments } from "@shipeasy/sdk/server";
configureShipeasy({ apiKey: process.env.SHIPEASY_SERVER_KEY! });
await flags.init();Then on the request path:
const result = experiments.assign<{ label: string }>("checkout-cta", {
user_id, plan, country,
});
const label = result.params.label; // "Pay" or "Buy now"The first time you call assign() for a user, an exposure event is logged automatically. Subsequent calls in the same process don't re-log — exposures are deduplicated.
Track the conversion
Wherever the purchase succeeds:
flags.track(user_id, "purchase", { value: orderTotal });The event lands in events store. The daily analysis job scans events and joins them against exposures by user_id.
Start the experiment
Until you start it, assign() returns inExperiment: false and the control params for everyone — safe.
shipeasy experiments start checkout-ctaOr click Start in the dashboard.
Read the results
Daily, the analysis cron computes per-metric, per-group results — lift, p-value, 95% CI — and writes them back to your project. Read them in the dashboard, or:
shipeasy experiments status checkout-ctaA typical result row:
purchase_conversion
control N=12,418 rate=4.8%
v1 N=12,503 rate=5.2% lift +8.3% p=0.018 CI [+1.4%, +15.2%]p < 0.05 and the CI excluding zero is your stop-condition. Promote v1 by adding a gate that mirrors the variant, then deleting the experiment.
Stop & clean up
shipeasy experiments stop checkout-ctaStopping is safe at any time. Stopped experiments keep their config and results; they just stop assigning.
Anti-patterns to avoid
The p-values are valid for fixed-horizon tests. If you sneak a look every hour and stop the moment something turns significant, you'll get false wins. Pre-decide the experiment's duration based on traffic.
Changing params on a running experiment invalidates the analysis. Stop the experiment, create a
new one with v2 in the name.
Two experiments touching the same checkout flow can confound each other. That's what universes and mutual exclusion are for.
Where to next
Universes & holdouts→
When you have more than one experiment in flight on the same surface.
Metrics, deeper→
Counts, means, ratios. Primary vs guardrail. Outlier handling.
How analysis works→
What the platform actually does to those numbers.
Events→
What you should track and what you definitely shouldn't.
Promote a gate to an experiment.
If a feature is already behind a gate at 50%, you're one CLI command away from collecting stats on it instead of guessing whether it works.