Patterns

When a killswitch is the right answer — what to put behind one, fail-safe defaults, alerting, and the post-incident cleanup.

Production readyOn this page · 6 min readUpdated · May 15, 2026Works with · Server SDK

A killswitch is a flag with one job: be the lever you pull when something is on fire. That makes it the wrong tool for most things — most things don't need a 3am-grade pause button. This page is the decision guide: when to add one, what to put behind it, and how to keep them maintained.

What deserves a killswitch

The rule of thumb: if a junior engineer on-call would need to flip this in five seconds without reading the docs, it's a killswitch. Otherwise it's a feature flag.

Concrete examples that do deserve one:

Outbound side effects. Email, SMS, push notifications, webhook fan-out. If your vendor goes down or starts retrying spuriously, you want one switch to stop the volume.
Paid third-party calls. Anything that costs money per call — LLM inference, payment-provider webhooks, fraud-scoring APIs. If their pricing goes weird or their service hangs, kill it.
Write-heavy background workers. Batch indexers, report generators, anything that mutates data in bulk. If they're producing bad output, pausing is cheaper than rolling back.
Risky new subsystems in the first month after launch. A new payment integration, a new recommender, a new auth path. You'll feel the absence of a killswitch in the first incident.

Examples that should not be killswitches:

Feature rollouts. That's what feature flags are for. A killswitch has no rollout percentage by design — you can't ramp it.
A/B variant selection. Use an experiment. A killswitch can't express "50% see A, 50% see B."
Permission checks. Killswitches bypass on incident, which is the opposite of what you want from an authz check.
Anything you'd put behind a feature flag and might never flip. That's a config or a feature flag.

Fail-safe defaults

Every killswitch has a "what happens when the SDK can't reach us" value. Pick it deliberately:

Killswitch	Default	Reasoning
`emails-enabled`	`true` (on)	Email is a core feature. Outage of Shipeasy shouldn't break email.
`experimental-recommender`	`false` (off)	New, unproven path. Outage of Shipeasy should fall back to safe.
`expensive-ai-summary`	`false` (off)	Costs money per call. Fail closed to protect the budget.
`payment-fraud-check`	`true` (on)	Skipping fraud check costs more than dropping a sale. Stay strict.
`audit-log-redaction`	`true` (on)	Compliance feature. Fail-open is a leak.

The pattern: fail-open for things whose absence creates a worse failure than their presence. If skipping the call is the dangerous outcome, default-on. If running the call is the dangerous outcome (cost, leakage, instability), default-off.

const flags = new Client(currentUser);

// Default-on (system normally runs):
const enabled = flags.getKillswitch("emails-enabled") ?? true;

// Default-off (system normally suppressed unless explicitly turned on):
const enabled = flags.getKillswitch("experimental-recommender") ?? false;

Express the default at the call site, not just in the dashboard config. A future reader can see the contract without leaving the file.

Audit + alerting

A killswitch flip is an incident-grade event. Two things should happen automatically when one flips:

A row gets written to the audit log. Who flipped it, when, from what surface (dashboard, CLI, API), and the reason string they passed.
A webhook fires. Wire it to PagerDuty, Opsgenie, or your incident channel. The flip should declare the incident, not require someone to remember to.

Wire the killswitch.flipped event through Feedback → Connectors in the dashboard. Pick the GitHub, Slack, or PagerDuty connector (depending on the severity) and configure the connector filter to only fire when the new value is true (kill activated). Scope by name in the same filter when you want PagerDuty for the payment killswitch but only Slack for the experimental- feature one.

Naming

Killswitch names appear in incident timelines and post-mortems. Pick names that read naturally in that context:

✓ emails-enabled, payments-enabled, recommender-enabled — reads as "is this system on?"
✗ email-killswitch, payment-flag, disable-recs — reads awkwardly, the verb-noun shape hides whether true or false is "running."

The convention: positive-sense, <system>-enabled. Then getKillswitch("emails-enabled") returning true reads as "yes, emails are enabled."

Post-incident cleanup

A killswitch flipped during an incident, the team fixed the underlying bug, and the switch is back on. Two cleanup steps that often get skipped:

Verify the killswitch still works. A flipped-then-restored killswitch sometimes drifts — maybe someone removed the in-code feature flag during the fix, or the webhook moved. Re-rehearse the flip in staging within a week.
Update the runbook. What was the trigger for this flip? Add it to the on-call runbook so the next engineer doesn't have to figure it out from scratch. The audit log entry is the raw material — turn it into a sentence in the runbook.

If a killswitch hasn't been flipped (even in staging) in 6 months, it might still work — but you don't know. Re-rehearse.

Killswitches in your tests

A killswitch wraps a real side effect. Your tests almost certainly don't want to actually send emails, but they also shouldn't bypass the killswitch path in a way that hides bugs in the wrapper itself.

The cleanest pattern: in test environments, run the SDK in bundle-mock mode with the killswitch set to whatever the test scenario needs:

test/setup.ts

import { configureForOffline } from "@shipeasy/sdk/server";

configureForOffline({
  snapshot: {
    flags: {
      version: "test",
      plan: "free",
      gates: {},
      configs: {},
      // `killed: 1` = the switch is thrown, so getKillswitch() returns true
      killswitches: { emails: { killed: 1 }, payments: { killed: 0 } },
    },
    experiments: { version: "test", universes: {}, experiments: {} },
  },
});

The SDK never hits the network in this mode, the killswitch values are deterministic, and your test reads through the same new Client(user).getKillswitch() call your production code does — no special-case bypass. configureForOffline() replaces the active configuration, so a suite can re-seed between cases.

The dashboard view

The Killswitches page in the dashboard shows the current state of every killswitch in your project, who last touched it, and a one-click flip control with a required reason field. Keep this tab pinned during incidents — it's the central console.

Hovering on a killswitch shows its recent flip history. If a switch has been flipped 5 times in the last week, that's a signal that the underlying system needs a real fix, not just a kill button.

Was this page helpful?

✎ Edit this page