Universes & holdouts
Group experiments to keep them from stepping on each other, and reserve a clean slice of users as a global control.
A universe is the world an experiment lives in. Every experiment belongs to exactly one universe, and the universe owns two of the most important guarantees the platform offers:
- A holdout — a fixed percentage of users excluded from every experiment in the universe. This becomes a clean global control you can compare aggregate behaviour against.
- An optional mutual exclusion rule — at most one experiment in the universe assigns each user, so two experiments touching the same surface can never confound each other.
Most projects start with a single default universe and never need another. You add universes when experiments start interfering with each other, or when you want a long-term holdout to measure the cumulative effect of all the experimentation you're doing.
The mental model
Think of a universe as a population: every eligible user is sliced into one of three buckets the moment they enter.
Holdout→
Pinned to the legacy code path forever. Never sees an experiment in this universe.
In an experiment→
Bucketed into one running experiment based on a stable per-user hash and the experiments' allocation weights.
Untargeted / unallocated→
Either ineligible (failed the universe's targeting gate) or in the leftover slice of unallocated traffic. Sees the default code path.
Bucketing is deterministic: the same user_id always lands in the same bucket for the same universe, even across processes and across deploys, because the hash function is hash(universe_salt, user_id) and the salt is set when the universe is created.
Holdouts
A 5% holdout in the checkout universe means: 5% of users are pinned to never see any experiment in that universe. They always get the legacy code path. With a holdout you can:
- Measure aggregate program lift. Compare the conversion rate of holdout users vs non-holdout users to estimate the net effect of all your checkout experimentation over time. This is the only honest way to answer "is the experimentation program collectively worth it?"
- Trust individual results more. When an experiment looks strong, the holdout is the sanity check that the broader population is also moving — a real win shows up in aggregate, not just inside the variant.
- Roll back instantly. Kill the universe and on the next poll cycle (≤ plan poll interval, typically 30s–5min), every user falls into the holdout's code path.
Holdouts are persistent: a user assigned to the holdout stays in it across experiments and across time, as long as the universe's salt and holdout percentage don't change.
shipeasy universes create checkout --holdout 5You can change the holdout size later, but realise that bumping it up reshuffles bucketing — some users move in, some out — which invalidates the long-term holdout comparison for that period. Avoid changing it during an active experiment.
An experiment's control group still uses every other experiment that's running. The holdout is
the only group that uses none of them. It answers a different question: "is everything we're
shipping, taken together, doing anything?"
Mutual exclusion
When you flip mutual exclusion on in a universe, ShipEasy guarantees a user is assigned to at most one running experiment in that universe. The bucket they fall into determines which experiment.
Two experiments that touch the same surface can confound each other — if checkout-cta and checkout-layout overlap on the same users, you can't tell which experiment moved the metric. Mutual exclusion makes the universe a fixed-pie model: experiments compete for the same allocation, you can't over-allocate, and the analysis stays clean.
shipeasy universes create checkout --holdout 5 --mutexHow allocation slots work under mutex
With mutex on, the post-holdout traffic is sliced into a single number line [0, 1). Each running experiment claims a contiguous slice equal to its allocation %. A user's hash maps to a single point on the line; whichever slice contains it is the experiment they get.
| holdout 5% | exp A 30% | exp B 20% | exp C 10% | unallocated 35% |
0 .05 .35 .55 .65 1.0Stop experiment A and the slice between .05 and .35 becomes unallocated on the next poll. Add experiment D with 15% allocation and it gets appended to the right of the last allocated slice. Allocations don't reshuffle when other experiments start or stop, which keeps user assignments stable.
Targeting gates on universes
A universe can carry a targeting gate — a regular gate that decides whether a user is eligible for the universe at all. Common shapes:
- "Logged-in users only" —
attr: user_id, op: neq, value: "". - "Pro plan only" —
attr: plan, op: eq, value: "pro". - "Specific country" —
attr: country, op: in, value: ["US","CA"]. - "Internal accounts excluded" —
attr: email, op: not_ends_with, value: "@yourcompany.com".
Users not eligible for the universe never enter the holdout, never see any experiment, and never count toward analysis. This is the safe place to exclude bots, internal staff, and bad-actor accounts — they vanish from every result row in the universe at once.
When to add a universe
Start with one universe per product surface: checkout, onboarding, pricing. Inside each,
run experiments freely. Across them, you're free to overlap because users on the pricing page
aren't the same population as users at checkout.
A pattern that works well at scale:
| Universe | Holdout | Mutex | Notes |
|---|---|---|---|
default | 0% | off | Catch-all for cheap tests on different surfaces. |
onboarding | 5% | on | High-leverage area; long-term holdout to measure cumulative lift. |
checkout | 5% | on | Same. |
pricing | 0% | on | No holdout because we don't want users with weird prices for too long. |
growth | 10% | on | Long-term holdout for marketing surfaces; report quarterly to the business. |
Rules of thumb:
- Mutex on by default for any surface where experiments could plausibly overlap. The cost is a smaller addressable population per experiment; the benefit is interpretable results.
- Holdout 5% is the sweet spot for most teams: large enough to detect aggregate lift over a quarter, small enough that you're not parking 1-in-10 users on legacy code.
- Holdout 0% is fine when the surface is short-lived (a single launch), when traffic is too low to detect aggregate lift, or when keeping any user on legacy code is unacceptable (pricing, billing).
Reading the holdout
In analysis, the holdout shows up as a synthetic group named holdout on every experiment in the universe. You can compare any variant against the holdout (instead of against control) for an end-to-end "did our experimentation help, vs not experimenting at all?" readout.
A typical universe-level summary in the dashboard:
checkout universe · last 30 days
holdout (5%) N=18,402 purchase_rate=4.6% AOV=$48.10
non-holdout (95%) N=349,671 purchase_rate=5.1% AOV=$49.40
Aggregate lift vs holdout:
purchase_rate +10.9% p=0.003 CI [+3.8%, +18.0%]
AOV +2.7% p=0.142 CI [-0.9%, +6.3%]This is the single best argument for keeping a holdout: it's the only way to detect when your experimentation program collectively doesn't move the needle.
API · universes.create
checkout.0–20. Defaults to 0. Bumping later reshuffles bucketing — set it once and leave it.true, at most one running experiment per user in this universe. Defaults to false.false are excluded entirely (no holdout, no experiment, not in analysis).Switching a user out of the holdout
Don't. The point is stability. If a specific user (e.g. an internal account) needs to bypass the holdout, use a gate override on the universe's targeting gate to exclude them — they'll then be eligible for experiments normally. Manually moving users in and out of the holdout breaks the comparability of every aggregate measurement and is not exposed in the API on purpose.
Changing the salt rebuckets every user. Any in-flight experiment in the universe will see its assignments reshuffle, the t-test will mix two populations, and results become uninterpretable. If you absolutely must reshuffle, stop every experiment in the universe first.
Pick a metric worth moving.
Universes give you a clean playing field. Metrics are how you score the game — primary, guardrail, and the outlier handling that keeps the numbers honest.