Edge cases

Bucketing is hash(experimentSalt + identityKey) % 10_000. The same identityKey always maps to the same bucket, so the same user lands in the same variant on every device, in every session, forever — as long as the key stays stable.

That stability is the whole trick. Pick the wrong key and "sticky" silently means "sticky per device", "sticky per session", or "sticky until the cookie clears". Pick it once and pick it for real:

import { gate } from "@shipeasy/sdk/server";

// Good — stable across devices and sessions
await gate("checkout-v2", { userId: session.user.id });

// Acceptable for logged-out — survives across sessions on one device
await gate("checkout-v2", { userId: anonymousIdFromCookie() });

// Bad — re-buckets every request, every reload, every tab
await gate("checkout-v2", { userId: crypto.randomUUID() });

If a visitor signs in mid-experiment they will re-bucket: anon cookie ID → real user ID changes the hash input. That's almost always what you want (you'd rather attribute the conversion to the signed-in user than to a throwaway), but it does mean you can't compare the same user across the sign-in boundary. If you need that — e.g. measuring sign-up rate itself — bucket on a stable device id that survives auth, and pass userId separately for analytics only.

The platform never mutates the salt for you. If you regenerate it (delete + recreate the experiment, rename it in a way that changes the hash input), every user re-buckets and your running results are invalid. Don't.

Holdouts that overlap experiments

A holdout is a slice of traffic that never sees any treatment in the universe — they always get baseline, for as long as the holdout runs. It's how you measure the cumulative lift of all the experiments you shipped this quarter, not just the lift of each one in isolation.

Allocation composes multiplicatively. With a 10% holdout and a 50/50 experiment:

Bucket	Share	Sees
Holdout	10%	Baseline (always)
Experiment, control	45%	Baseline (this round)
Experiment, treatment	45%	Treatment

That has two consequences worth internalising:

Power drops. A 10% holdout removes 10% of your sample from every experiment. A 50/50 split that used to need 14 days now needs ~15.5 to hit the same MDE. Run holdouts when you care more about the long-term answer than about shipping one extra experiment a quarter.
"Treatment" in the experiment is not "treatment overall." The 45% control group sees today's baseline, which already includes everything previous experiments shipped. The holdout sees the pre-quarter baseline. Quarterly readout = holdout vs. everyone-else, not control vs. treatment.

Holdouts live on the universe, not on individual experiments. That's deliberate — it's the only way the holdout slice stays the same across experiments and quarters; tying it per-experiment would re-bucket users every time you launched something new.

Sample ratio mismatch (SRM)

You asked for 50/50. You're getting 49/51 with p<0.001 on a chi-squared. That is not bad luck at 100k users — the chance of seeing a 49/51 split by random sampling on n=100,000 is roughly 1 in 10⁶. Something assigned users non-uniformly.

The daily analysis job runs chi-squared on the observed assignment counts and flags SRM on the experiment status page. When you see the flag, stop reading the results. The lift number is no longer trustworthy: whatever caused the imbalance is almost certainly also correlated with the metric.

The usual suspects, in order of how often we see them:

Logging racing assignment. The SDK assigns synchronously; the analytics call is async. A page that navigates away before the event flushes loses one variant more than the other (the treatment renders slower, the tab closes mid-render, the event never lands).
Bot filtering on one variant only. Bots are deterministic — they always hash into the same bucket. If your bot filter runs after assignment but before the event hits Analytics Engine, the bot-heavy variant gets cut.
Caching at the edge. A page is cached without the variant in the cache key, and one variant is served from cache (no assignment event) while the other re-renders (assignment event fires).
Sticky bucketing leak. Logged-out users on a country page use IP-based identity; the IP changes per request behind a corporate NAT; one variant happens to be cached.

Treat SRM as a bug in the logging or assignment path, not as a stats problem to round away.

Write propagation latency

Edits in the dashboard write to D1, rebuild the KV blob, and explicit-purge the CDN edge cache. P99 end-to-end is under a second; P50 is ~150ms. The SDK reads the KV blob, which is cached at the nearest edge with an infinite TTL until purged — that's how the hot path stays sub-10ms.

The 1% case is when an edge node you're hitting hasn't seen the purge yet and serves the previous blob. For most flags, that's fine: a user evaluates one extra request against the old value, then the next evaluation has the new one. For killswitches in an active incident, "one extra request" might be a thousand emails sent or a million dollars charged.

For instant-kill semantics, evaluate server-side at request time and short-circuit on the killswitch before doing the dangerous thing:

import { gate } from "@shipeasy/sdk/server";

export async function POST(req: Request) {
  if (!(await gate("emails-enabled"))) {
    return new Response("paused", { status: 503 });
  }
  return sendEmail(await req.json());
}

The SDK has a small in-memory cache (default 10s) on top of the KV read. If you cannot tolerate even that, pass { maxAge: 0 } and pay one KV round-trip per call.

SSR/CSR flicker

The default client-side gate() returns false on first render (no value yet), then the real value on hydration. On a page where the flag controls a visible component, that's a flash of the control variant before the treatment appears — exactly the flicker we don't want.

The fix is to evaluate on the server, ship the result down with the page, and hand it to the client SDK so the first client render already has the right value:

// app/page.tsx (Server Component)
import { gate } from "@shipeasy/sdk/server";
import { ShipeasyProvider } from "@/components/shipeasy-provider";

export default async function Page() {
  const flags = {
    "checkout-v2": await gate("checkout-v2", { userId }),
    "hero.title": await config<string>("hero.title"),
  };
  return (
    <ShipeasyProvider initial={flags}>
      <Page />
    </ShipeasyProvider>
  );
}

// components/shipeasy-provider.tsx (Client Component)
"use client";
import { shipeasy } from "@shipeasy/sdk/client";
import { useEffect } from "react";

export function ShipeasyProvider({ initial, children }) {
  useEffect(() => {
    shipeasy({
      apiKey: process.env.NEXT_PUBLIC_SHIPEASY_CLIENT_KEY!,
      initial,
    });
  }, [initial]);
  return children;
}

First client render reads initial.checkout-v2, matches the SSR output, no hydration mismatch. The SDK takes over for subsequent evaluations.

Fail-safe defaults

If KV is unreachable (network blip, cold start, transient edge issue), the SDK returns:

gate(name) → false
config(name) → last-known-good from the in-memory cache, else undefined
experiment(name).variant → "control" (the universe baseline)

That's the safe default for the common case, where "feature off" is benign. The dangerous case is when off is the failure mode — a killswitch that defaults to false would un-pause emails during an incident. Override per call:

// "off" is dangerous — default open, fail closed
await gate("emails-enabled", ctx, { defaultValue: true });

// "treatment" is dangerous — pin to control on read failure
await experiment("paywall-test", ctx, { fallbackVariant: "control" });

Set defaults at the call site, not in a wrapper. The defaultValue is the contract of "what does this code do if Shipeasy is dead?" and it should be readable next to the code that depends on it.

More edge cases

The six below are the second wave — failure modes that bite once you've shipped a few experiments and a dozen feature flags, not on day one. One per primitive in this section's sub-menu.

◇

Feature flags → flag sprawl→

Dead feature flags at 100% are technical debt. How to find them and rip them out without breaking code.

⚙

Configs → schema drift→

A dashboard value that no longer matches your code's expected shape. Schema validation + rollback strategy.

◉

Killswitches → decay→

A killswitch that hasn't been flipped in 6 months might still work. Or it might not. Rehearsal is the only way to know.

◐

Experiments → peeking→

Daily p-value checks blow the false-positive rate to ~30%. The honest options for early looks.

∑

Metrics → low-traffic experiments→

Underpowered tests look "neutral" — but neutral and "we can't tell" are different conclusions.

◔

Cross-env flag drift→

Staging and prod versions of the same flag fall out of sync. How to find it before users do.

Feature flag sprawl

After 12 months of shipping, you have 187 feature flags. Half are at 100% and have been for months. The code that reads them is dead conditionals — a branch the linter still respects but no human will ever take.

That's not aesthetic — it's a real cost:

Every if (await gate(...)) adds a layer of indirection in code review. New engineers spend time tracing why a 100% feature flag exists before realising it's dead.
Removed features that left their feature flags behind become Chesterton's-fence puzzles — was this rolled back? Is it still in flight? Nobody remembers.
Stale feature flags are still polled. The KV bundle isn't huge, but every feature flag's targeting rules and override list ship with every poll.

The Shipeasy Cleanup view surfaces candidates: feature flags at 100% or 0% for more than 30 days, with no targeting rule changes in 60 days. Run through it monthly. For each candidate, three options:

Delete. The default. The feature flag is at 100% (or 0%), the code branch behind it is settled — rip out the conditional, delete the feature flag.
Convert to a killswitch. If the feature flag is "at 100% but you want to keep the lever," promote it to a killswitch. Different semantics, different sub-menu, signals to the team it's incident-grade.
Document why it stays. Rare — usually because of a planned future change. Add a note to the feature flag description so a future cleanup pass doesn't axe it by mistake.

Automate step 1 with the CLI:

# Every gate at 100% rollout (filter to long-stable ones using the
# created_at field + your own threshold)
shipeasy release flags list | jq -r '.[] | select(.rolloutPct == 10000) | .name'

# Sanity-check the codebase no longer references each candidate before
# deleting — grep is the simplest correct tool:
git grep -nE "flags\\.get\\(['\"](dark-mode|checkout-v2|new-search)['\"]"

# Once nothing matches, delete them one at a time (the CLI's `delete`
# takes one name per invocation):
shipeasy release flags archive dark-mode
shipeasy release flags archive checkout-v2
shipeasy release flags archive new-search

The grep step is non-negotiable. Deleting a feature flag that's still referenced in code is fine (the SDK returns the defaultValue), but you want the dead conditional gone too — otherwise you've just moved the technical debt.

Config schema drift

You stored a structured config — say a list of feature tiers with prices. Six months later, a new engineer adds a description field to the type, ships the code, and reads the config:

const tiers = await config<Tier[]>("plans.tiers");
return tiers.map((t) => (
  <Card key={t.id} title={t.name} desc={t.description}>...</Card>
));

The dashboard JSON still has the old shape — no description. In dev, the cards render with undefined where the description should be. In prod the same.

The cause: the value in the dashboard is decoupled from the shape in code. Adding a field to the TypeScript type doesn't add it to the dashboard. The SDK doesn't validate.

Two patterns to defend against this:

1. Validate on read

Use a schema (Zod, Valibot, runtypes) and validate every read. Fall back to a safe default if the parse fails, and log loudly so you find out about the drift:

import { z } from "zod";

const TierSchema = z.object({
  id: z.string(),
  name: z.string(),
  description: z.string().default(""),
  priceCents: z.number().int().nonnegative(),
});

const tiers = await config<unknown>("plans.tiers", { default: [] });
const parsed = z.array(TierSchema).safeParse(tiers);
if (!parsed.success) {
  reportToSentry("Config schema drift", { configName: "plans.tiers", issues: parsed.error.issues });
  return DEFAULT_TIERS;
}
return parsed.data;

The schema defaults (description: z.string().default("")) absorb additive changes without breaking. Removing a field still breaks — Zod's strict() would catch it.

2. Version the config name

When the shape changes incompatibly, rename:

// Old code path
const tiers = await config<TierV1[]>("plans.tiers.v1");
// New code path
const tiers = await config<TierV2[]>("plans.tiers.v2");

Two configs co-exist briefly while you migrate. Old shape readers keep working; new shape readers look at the new name. When the migration's done, delete the old config. The dashboard treats them as unrelated, so a typo in v2 doesn't break v1.

Use this for structural changes (renaming a field, changing a type). For additive changes, the schema-with-defaults pattern is enough.

Killswitches decay

A killswitch is only useful if it works during the incident. A killswitch that hasn't been flipped in 6 months probably still works — but you don't know.

Three failure modes silently break a killswitch:

The wrapped code path was refactored. Someone moved the dangerous call to a new function and forgot the if (!(await gate(...))) wrapper. The killswitch still exists; flipping it does nothing.
A new code path bypasses it. A worker, a batch job, or a webhook handler was added later that calls the same dangerous side effect without the wrapper. The killswitch covers the old path; the new path is naked.
The webhook moved. The PagerDuty integration URL changed, no one updated the killswitch.flipped webhook, flipping the kill no longer pages.

The fix is rehearsal. Once a quarter, run a drill in staging:

# Flip the killswitch ON (kill = true) for the drill
shipeasy release killswitch update transactional.emails-enabled --value true

# Run the canonical test that exercises every email-sending code path
pnpm test:integration --grep email

# Confirm zero emails went out, then restore
shipeasy release killswitch update transactional.emails-enabled --value false

If any test sent an email while the killswitch was off, you've found a naked code path. Fix before you need the killswitch in production.

The Shipeasy dashboard flags killswitches that haven't been flipped (in any env) in 90+ days with a yellow "stale" chip. Treat that chip as "drill overdue."

Peeking & sequential testing

You launched an experiment with a 14-day plan. On day 3 you peek and the p-value is 0.04. Tempting to call it a win and ship. Don't.

Daily peeking on a fixed-horizon t-test inflates the false-positive rate dramatically:

Peeks during experiment	True α	Effective false-positive rate
1 (planned end)	0.05	0.050
5 (weekly)	0.05	~0.14
14 (daily)	0.05	~0.28
30 (continuous)	0.05	~0.40

So a "p < 0.05" call from daily peeking is closer to a 28% false-positive rate. You'd be shipping random noise 1 in 4 times.

Two honest options:

1. Pre-commit to a duration and don't peek

Set duration from your power calculation. Resist looking. The dashboard supports "show me only at T+plan" mode — enable it on an experiment to hide the running p-value entirely until the planned end:

In the dashboard, open the experiment's Analysis settings and tick "Hide interim results until planned end" — the running p-value and CI are replaced by a countdown to the planned horizon. Once the experiment reaches that horizon the panel unlocks.

2. Use sequential testing

Sequential tests (mSPRT, group-sequential bounds) compute adjusted p-values that account for continuous looks. Peeking is fine; the math handles it.

This is a project-wide default on the Team plan and above (sequential_testing: true per packages/core/src/config/plans.ts). To enable on a specific experiment, open its Analysis settings and switch the analysis mode to Sequential.

Sequential is less powerful at the planned horizon than fixed-horizon — you pay a small efficiency tax for the right to peek. If you don't need to peek, fixed-horizon is more efficient. If you do (e.g. safety-critical experiment where you want to stop the moment a regression is visible), sequential is the only honest choice.

Either way: don't peek under fixed-horizon and then claim significance. That's the most common methodological error in industry A/B testing and the easiest one to spot in a post-mortem.

Low-traffic experiments

You have 200 users a day. You want to test a checkout change. Baseline conversion is 4%. You want to detect a +10% relative lift.

Per the power table: you need ~6,500 users per arm. At 100 users per arm per day (50/50 split), that's 65 days. Plus a week of ramp-up, plus weekly seasonality. Call it 90 days.

Three failure modes when you ignore this:

You stop at 14 days "because the lift looks neutral." Neutral underpowered is not the same as zero — your CI is [-25%, +25%]. You learned nothing.
You see a "win" at day 7 and ship. That's noise (see Peeking above) or day-of-week effects.
You stop at 30 days because the dashboard says "trending positive." Same noise problem.

Three things you can do with low traffic:

Loosen the MDE

You don't need to detect a 10% lift. Detecting a 25% lift requires ~1,000 users per arm, or 10 days. A 25% lift is also a more practically interesting hypothesis — small improvements aren't worth the org cost of running experiments at scale.

Adjust the MDE in the experiment wizard's Power calc card (you can nudge it on a draft) — the wizard recomputes runtime live as you slide the value. For running experiments, edit min_runtime_days / min_sample_size instead:

shipeasy release experiments update paywall-v2 \
  --min-runtime-days 10 --min-sample-size 1000

CUPED for variance reduction

If you have a pre-experiment covariate correlated with the metric, CUPED can shave 10–30% off the variance — equivalent to 10–30% more users.

shipeasy metrics create purchase_conversion \
  --event-name purchase \
  --query 'count_users(purchase)'
# Enable CUPED + pick the pre-experiment covariate on the metric's
# Variance reduction tab in the dashboard (Team plan and above).

Don't run an experiment

The fourth option is the right one for many low-traffic shops: ramp the change as a gate over two weeks, watch the dashboards for obvious regressions, and ship if nothing breaks. You forgo the rigorous "did this work" answer in exchange for actually getting features out. Sometimes the right trade.

What you should not do: run an underpowered experiment and tell the team "the result was neutral." It wasn't neutral; it was inconclusive. The dashboard tags this state as insufficient_power — read the tag.

Cross-env flag drift

You changed a feature flag in production. The same feature flag in staging is still on the old config. A week later, a new dev tests against staging, sees v1 behaviour, ships code that depends on v1, production breaks because production is on v2.

The cause is straightforward: feature flag config is per-environment, and there's no automatic sync.

Three patterns to keep envs aligned:

1. Mirror writes per-env

Today feature flags live at the project scope (not per-env) — the same rollout % applies in staging and prod. To run different rollouts per env, create a sibling feature flag per env (e.g. checkout-v2.staging, checkout-v2.prod) and pin the SDK to the matching name based on which env's SDK key it's booted with:

shipeasy release flags update checkout-v2.staging --rollout-percent 100
shipeasy release flags update checkout-v2.prod --rollout-percent 25

Per-env values on a single feature flag is on the roadmap — until then, the naming convention is what enforces the boundary.

2. Diff env state by hand

A first-class env-diff tool isn't shipped today. Until it lands, the practical pattern is to dump shipeasy release flags list from each env (by re-binding to the appropriate project / SDK key) and jq the delta:

shipeasy release flags list > staging.json
shipeasy release flags list > prod.json
diff <(jq -S . staging.json) <(jq -S . prod.json)

3. Promote rather than re-create

For new flags, create in staging, qualify, then promote to prod rather than re-creating:

# Read the qualified definition out of one project and create it in the other
shipeasy release flags get checkout-v2 > checkout-v2.json

shipeasy release flags create checkout-v2 \
  --rules "$(jq -c .rules checkout-v2.json)" \
  --rollout-pct "$(jq -r .rolloutPct checkout-v2.json)" \
  --salt "$(jq -r .salt checkout-v2.json)"

Carrying the --salt across is what makes the two match: bucketing is hash(salt:unit), so the same salt puts the same users in the same slice. Subsequent edits drift again unless you mirror them.

The trade-off you're navigating: dev velocity (different states per env) vs. release safety (same behaviour everywhere). The right balance depends on whether your bugs tend to come from "flag config diverged" or "I couldn't test the new state in staging." Most teams over-correct one way or the other; pick consciously.

Was this page helpful?

✎ Edit this page

Edge cases

Sticky bucketing across devices→

Holdouts that overlap experiments→

Sample ratio mismatch (SRM) detection→

Write propagation latency→

Avoiding SSR/CSR flicker→

Fail-safe defaults when KV is unreachable→