ShipEasy
Flags & ExperimentsEdge cases

Edge cases

Things that bite you once and never again — sticky bucketing, holdouts, SRM, KV propagation, and more.

The patterns below are the ones you wish someone told you on day one.

Sticky bucketing across devices

Bucketing is hash(experimentSalt + identityKey) % 10_000. The same identityKey always maps to the same bucket, so the same user lands in the same variant on every device, in every session, forever — as long as the key stays stable.

That stability is the whole trick. Pick the wrong key and "sticky" silently means "sticky per device", "sticky per session", or "sticky until the cookie clears". Pick it once and pick it for real:

import { gate } from "@shipeasy/sdk/server";

// Good — stable across devices and sessions
await gate("checkout-v2", { userId: session.user.id });

// Acceptable for logged-out — survives across sessions on one device
await gate("checkout-v2", { userId: anonymousIdFromCookie() });

// Bad — re-buckets every request, every reload, every tab
await gate("checkout-v2", { userId: crypto.randomUUID() });

If a visitor signs in mid-experiment they will re-bucket: anon cookie ID → real user ID changes the hash input. That's almost always what you want (you'd rather attribute the conversion to the signed-in user than to a throwaway), but it does mean you can't compare the same user across the sign-in boundary. If you need that — e.g. measuring sign-up rate itself — bucket on a stable device id that survives auth, and pass userId separately for analytics only.

The platform never mutates the salt for you. If you regenerate it (delete + recreate the experiment, rename it in a way that changes the hash input), every user re-buckets and your running results are invalid. Don't.

Holdouts that overlap experiments

A holdout is a slice of traffic that never sees any treatment in the universe — they always get baseline, for as long as the holdout runs. It's how you measure the cumulative lift of all the experiments you shipped this quarter, not just the lift of each one in isolation.

Allocation composes multiplicatively. With a 10% holdout and a 50/50 experiment:

BucketShareSees
Holdout10%Baseline (always)
Experiment, control45%Baseline (this round)
Experiment, treatment45%Treatment

That has two consequences worth internalising:

  1. Power drops. A 10% holdout removes 10% of your sample from every experiment. A 50/50 split that used to need 14 days now needs ~15.5 to hit the same MDE. Run holdouts when you care more about the long-term answer than about shipping one extra experiment a quarter.
  2. "Treatment" in the experiment is not "treatment overall." The 45% control group sees today's baseline, which already includes everything previous experiments shipped. The holdout sees the pre-quarter baseline. Quarterly readout = holdout vs. everyone-else, not control vs. treatment.

Holdouts live on the universe, not on individual experiments. That's deliberate — it's the only way the holdout slice stays the same across experiments and quarters; tying it per-experiment would re-bucket users every time you launched something new.

Sample ratio mismatch (SRM)

You asked for 50/50. You're getting 49/51 with p<0.001 on a chi-squared. That is not bad luck at 100k users — the chance of seeing a 49/51 split by random sampling on n=100,000 is roughly 1 in 10⁶. Something assigned users non-uniformly.

The daily analysis job runs chi-squared on the observed assignment counts and flags SRM on the experiment status page. When you see the flag, stop reading the results. The lift number is no longer trustworthy: whatever caused the imbalance is almost certainly also correlated with the metric.

The usual suspects, in order of how often we see them:

  • Logging racing assignment. The SDK assigns synchronously; the analytics call is async. A page that navigates away before the event flushes loses one variant more than the other (the treatment renders slower, the tab closes mid-render, the event never lands).
  • Bot filtering on one variant only. Bots are deterministic — they always hash into the same bucket. If your bot filter runs after assignment but before the event hits Analytics Engine, the bot-heavy variant gets cut.
  • Caching at the edge. A page is cached without the variant in the cache key, and one variant is served from cache (no assignment event) while the other re-renders (assignment event fires).
  • Sticky bucketing leak. Logged-out users on a country page use IP-based identity; the IP changes per request behind a corporate NAT; one variant happens to be cached.

Treat SRM as a bug in the logging or assignment path, not as a stats problem to round away.

Write propagation latency

Edits in the dashboard write to D1, rebuild the KV blob, and explicit-purge the CDN edge cache. P99 end-to-end is under a second; P50 is ~150ms. The SDK reads the KV blob, which is cached at the nearest edge with an infinite TTL until purged — that's how the hot path stays sub-10ms.

The 1% case is when an edge node you're hitting hasn't seen the purge yet and serves the previous blob. For most flags, that's fine: a user evaluates one extra request against the old value, then the next evaluation has the new one. For killswitches in an active incident, "one extra request" might be a thousand emails sent or a million dollars charged.

For instant-kill semantics, evaluate server-side at request time and short-circuit on the killswitch before doing the dangerous thing:

import { gate } from "@shipeasy/sdk/server";

export async function POST(req: Request) {
  if (!(await gate("emails-enabled"))) {
    return new Response("paused", { status: 503 });
  }
  return sendEmail(await req.json());
}

The SDK has a small in-memory cache (default 10s) on top of the KV read. If you cannot tolerate even that, pass { maxAge: 0 } and pay one KV round-trip per call.

SSR/CSR flicker

The default client-side gate() returns false on first render (no value yet), then the real value on hydration. On a page where the flag controls a visible component, that's a flash of the control variant before the treatment appears — exactly the flicker we don't want.

The fix is to evaluate on the server, ship the result down with the page, and hand it to the client SDK so the first client render already has the right value:

// app/page.tsx (Server Component)
import { gate } from "@shipeasy/sdk/server";
import { ShipeasyProvider } from "@/components/shipeasy-provider";

export default async function Page() {
  const flags = {
    "checkout-v2": await gate("checkout-v2", { userId }),
    "hero.title": await config<string>("hero.title"),
  };
  return (
    <ShipeasyProvider initial={flags}>
      <Page />
    </ShipeasyProvider>
  );
}
// components/shipeasy-provider.tsx (Client Component)
"use client";
import { shipeasy } from "@shipeasy/sdk/client";
import { useEffect } from "react";

export function ShipeasyProvider({ initial, children }) {
  useEffect(() => {
    shipeasy({
      apiKey: process.env.NEXT_PUBLIC_SHIPEASY_CLIENT_KEY!,
      initial,
    });
  }, [initial]);
  return children;
}

First client render reads initial.checkout-v2, matches the SSR output, no hydration mismatch. The SDK takes over for subsequent evaluations.

Fail-safe defaults

If KV is unreachable (network blip, cold start, transient edge issue), the SDK returns:

  • gate(name)false
  • config(name) → last-known-good from the in-memory cache, else undefined
  • experiment(name).variant"control" (the universe baseline)

That's the safe default for the common case, where "feature off" is benign. The dangerous case is when off is the failure mode — a killswitch that defaults to false would un-pause emails during an incident. Override per call:

// "off" is dangerous — default open, fail closed
await gate("emails-enabled", ctx, { defaultValue: true });

// "treatment" is dangerous — pin to control on read failure
await experiment("paywall-test", ctx, { fallbackVariant: "control" });

Set defaults at the call site, not in a wrapper. The defaultValue is the contract of "what does this code do if Shipeasy is dead?" and it should be readable next to the code that depends on it.

More edge cases

The six below are the second wave — failure modes that bite once you've shipped a few experiments and a dozen gates, not on day one. One per primitive in this section's sub-menu.

Gate sprawl

After 12 months of shipping, you have 187 gates. Half are at 100% and have been for months. The code that reads them is dead conditionals — a branch the linter still respects but no human will ever take.

That's not aesthetic — it's a real cost:

  • Every if (await gate(...)) adds a layer of indirection in code review. New engineers spend time tracing why a 100% gate exists before realising it's dead.
  • Removed features that left their gates behind become Chesterton's-fence puzzles — was this rolled back? Is it still in flight? Nobody remembers.
  • Stale gates are still polled. The KV bundle isn't huge, but every gate's targeting rules and override list ship with every poll.

The Shipeasy Cleanup view surfaces candidates: gates at 100% or 0% for more than 30 days, with no targeting rule changes in 60 days. Run through it monthly. For each candidate, three options:

  1. Delete. The default. The gate is at 100% (or 0%), the code branch behind it is settled — rip out the conditional, delete the gate.
  2. Convert to a killswitch. If the gate is "at 100% but you want to keep the lever," promote it to a killswitch. Different semantics, different sub-menu, signals to the team it's incident-grade.
  3. Document why it stays. Rare — usually because of a planned future change. Add a note to the gate description so a future cleanup pass doesn't axe it by mistake.

Automate step 1 with the CLI:

# Every gate at 100% for >30 days
shipeasy flags list --json \
  | jq -r '.[] | select(.rolloutPct == 10000 and .stableForDays >= 30) | .name'

# Validate the codebase no longer references each one
shipeasy flags validate ./src --names dark-mode,checkout-v2,new-search

# Once green, delete
shipeasy flags delete dark-mode checkout-v2 new-search

The validate step is non-negotiable. Deleting a gate that's still referenced in code is fine (the SDK returns the defaultValue), but you want the dead conditional gone too — otherwise you've just moved the technical debt.

Config schema drift

You stored a structured config — say a list of feature tiers with prices. Six months later, a new engineer adds a description field to the type, ships the code, and reads the config:

const tiers = await config<Tier[]>("plans.tiers");
return tiers.map((t) => (
  <Card key={t.id} title={t.name} desc={t.description}>...</Card>
));

The dashboard JSON still has the old shape — no description. In dev, the cards render with undefined where the description should be. In prod the same.

The cause: the value in the dashboard is decoupled from the shape in code. Adding a field to the TypeScript type doesn't add it to the dashboard. The SDK doesn't validate.

Two patterns to defend against this:

1. Validate on read

Use a schema (Zod, Valibot, runtypes) and validate every read. Fall back to a safe default if the parse fails, and log loudly so you find out about the drift:

import { z } from "zod";

const TierSchema = z.object({
  id: z.string(),
  name: z.string(),
  description: z.string().default(""),
  priceCents: z.number().int().nonnegative(),
});

const tiers = await config<unknown>("plans.tiers", { default: [] });
const parsed = z.array(TierSchema).safeParse(tiers);
if (!parsed.success) {
  reportToSentry("Config schema drift", { configName: "plans.tiers", issues: parsed.error.issues });
  return DEFAULT_TIERS;
}
return parsed.data;

The schema defaults (description: z.string().default("")) absorb additive changes without breaking. Removing a field still breaks — Zod's strict() would catch it.

2. Version the config name

When the shape changes incompatibly, rename:

// Old code path
const tiers = await config<TierV1[]>("plans.tiers.v1");
// New code path
const tiers = await config<TierV2[]>("plans.tiers.v2");

Two configs co-exist briefly while you migrate. Old shape readers keep working; new shape readers look at the new name. When the migration's done, delete the old config. The dashboard treats them as unrelated, so a typo in v2 doesn't break v1.

Use this for structural changes (renaming a field, changing a type). For additive changes, the schema-with-defaults pattern is enough.

Killswitches decay

A killswitch is only useful if it works during the incident. A killswitch that hasn't been flipped in 6 months probably still works — but you don't know.

Three failure modes silently break a killswitch:

  1. The wrapped code path was refactored. Someone moved the dangerous call to a new function and forgot the if (!(await gate(...))) wrapper. The killswitch still exists; flipping it does nothing.
  2. A new code path bypasses it. A worker, a batch job, or a webhook handler was added later that calls the same dangerous side effect without the wrapper. The killswitch covers the old path; the new path is naked.
  3. The webhook moved. The PagerDuty integration URL changed, no one updated the killswitch.flipped webhook, flipping the kill no longer pages.

The fix is rehearsal. Once a quarter, run a drill in staging:

# Flip in staging
shipeasy killswitch off emails-enabled --env staging --reason "drill 2026-Q2"

# Run the canonical test that exercises every email-sending code path
pnpm test:integration --grep email

# Confirm zero emails went out, then restore
shipeasy killswitch on emails-enabled --env staging --reason "drill complete"

If any test sent an email while the killswitch was off, you've found a naked code path. Fix before you need the killswitch in production.

The Shipeasy dashboard flags killswitches that haven't been flipped (in any env) in 90+ days with a yellow "stale" chip. Treat that chip as "drill overdue."

Peeking & sequential testing

You launched an experiment with a 14-day plan. On day 3 you peek and the p-value is 0.04. Tempting to call it a win and ship. Don't.

Daily peeking on a fixed-horizon t-test inflates the false-positive rate dramatically:

Peeks during experimentTrue αEffective false-positive rate
1 (planned end)0.050.050
5 (weekly)0.05~0.14
14 (daily)0.05~0.28
30 (continuous)0.05~0.40

So a "p < 0.05" call from daily peeking is closer to a 28% false-positive rate. You'd be shipping random noise 1 in 4 times.

Two honest options:

1. Pre-commit to a duration and don't peek

Set duration from your power calculation. Resist looking. The dashboard supports "show me only at T+plan" mode — enable it on an experiment to hide the running p-value entirely until the planned end:

shipeasy experiments update paywall-v2 --hide-interim-results

2. Use sequential testing

Sequential tests (mSPRT, group-sequential bounds) compute adjusted p-values that account for continuous looks. Peeking is fine; the math handles it.

shipeasy experiments update paywall-v2 --analysis-mode sequential

Sequential is less powerful at the planned horizon than fixed-horizon — you pay a small efficiency tax for the right to peek. If you don't need to peek, fixed-horizon is more efficient. If you do (e.g. safety-critical experiment where you want to stop the moment a regression is visible), sequential is the only honest choice.

Either way: don't peek under fixed-horizon and then claim significance. That's the most common methodological error in industry A/B testing and the easiest one to spot in a post-mortem.

Low-traffic experiments

You have 200 users a day. You want to test a checkout change. Baseline conversion is 4%. You want to detect a +10% relative lift.

Per the power table: you need ~6,500 users per arm. At 100 users per arm per day (50/50 split), that's 65 days. Plus a week of ramp-up, plus weekly seasonality. Call it 90 days.

Three failure modes when you ignore this:

  1. You stop at 14 days "because the lift looks neutral." Neutral underpowered is not the same as zero — your CI is [-25%, +25%]. You learned nothing.
  2. You see a "win" at day 7 and ship. That's noise (see Peeking above) or day-of-week effects.
  3. You stop at 30 days because the dashboard says "trending positive." Same noise problem.

Three things you can do with low traffic:

Loosen the MDE

You don't need to detect a 10% lift. Detecting a 25% lift requires ~1,000 users per arm, or 10 days. A 25% lift is also a more practically interesting hypothesis — small improvements aren't worth the org cost of running experiments at scale.

shipeasy experiments power paywall-v2 \
  --metric purchase_conversion \
  --min-detectable-effect 0.25

CUPED for variance reduction

If you have a pre-experiment covariate correlated with the metric, CUPED can shave 10–30% off the variance — equivalent to 10–30% more users.

shipeasy metrics create purchase_conversion \
  --type conversion --event purchase \
  --cuped-covariate previous_28d_purchase_rate

Don't run an experiment

The fourth option is the right one for many low-traffic shops: ramp the change as a gate over two weeks, watch the dashboards for obvious regressions, and ship if nothing breaks. You forgo the rigorous "did this work" answer in exchange for actually getting features out. Sometimes the right trade.

What you should not do: run an underpowered experiment and tell the team "the result was neutral." It wasn't neutral; it was inconclusive. The dashboard tags this state as insufficient_power — read the tag.

Cross-env flag drift

You changed a gate in production. The same gate in staging is still on the old config. A week later, a new dev tests against staging, sees v1 behaviour, ships code that depends on v1, production breaks because production is on v2.

The cause is straightforward: gate config is per-environment, and there's no automatic sync.

Three patterns to keep envs aligned:

1. Mirror writes via the CLI

For everyday flag tweaks (rollout %, targeting rules), apply to both envs in the same script:

for env in staging production; do
  shipeasy flags update checkout-v2 --rollout 100 --env "$env"
done

Annoying. Worth it.

2. Use the env-aware diff tool

shipeasy flags diff --env-a staging --env-b production

Outputs a table of every flag whose config differs between the two envs, with the diff. Run it in CI before merging to main, or as a weekly cron that posts the diff to your team channel.

3. Promote rather than re-create

For new flags, create in staging, qualify, then promote to prod rather than re-creating:

shipeasy flags promote checkout-v2 --from staging --to production

Copies the full config (rules, rollout, overrides, salt) atomically. The two envs match exactly after the promote; subsequent edits drift again unless you mirror.

The trade-off you're navigating: dev velocity (different states per env) vs. release safety (same behaviour everywhere). The right balance depends on whether your bugs tend to come from "flag config diverged" or "I couldn't test the new state in staging." Most teams over-correct one way or the other; pick consciously.

On this page