Skip to main content

Disaster Recovery & Attack Prevention

This page covers what to do when things go wrong — and how the MPC architecture and layered encryption policies make most attacks structurally infeasible.


Emergency Private Key Recovery

Recovery of the raw private key is a last-resort operation used only for migration to a new system, legal holds, or catastrophic infrastructure loss. It requires quorum approval and is irreversible if deleteAfterRecover: true is set.

import { WorkspaceClient, ComponentModule } from 'caller-sdk';

const workspace = new WorkspaceClient({ apiKey: process.env.WR_API_KEY! });

async function emergencyRecoverPrivateKey(
keyId: string,
quorumApprovers: string[], // logged for audit — minimum 2
): Promise<string> {
if (quorumApprovers.length < 2) {
throw new Error('Recovery requires at least 2 quorum approvers');
}

// Log BEFORE proceeding — the record must exist even if recovery fails
await audit.log({
event: 'PRIVATE_KEY_RECOVERY_INITIATED',
keyId,
approvers: quorumApprovers,
timestamp: new Date().toISOString(),
});

// Combines threshold shares from 2+ nodes to reconstruct the raw private key.
// This is the ONLY White Rabbit operation that produces plaintext key material.
const { privateKey } = await workspace
.call(ComponentModule.RECOVER_PRIVATE_KEY, {
keyId,
// deleteAfterRecover: true — irreversible; only set once you are certain
// the node shares are no longer needed.
})
.promise();

await audit.log({
event: 'PRIVATE_KEY_RECOVERY_COMPLETED',
keyId,
approvers: quorumApprovers,
timestamp: new Date().toISOString(),
});

// Transfer to HSM immediately — never log, never write to disk in plaintext.
return privateKey;
}
RECOVER_PRIVATE_KEY exposes raw key material

The privateKey in the output is a plaintext hex-encoded private key. Handle it accordingly:

  • Never log it — exclude from structured logging pipelines
  • Never write it to disk in plaintext — transfer directly into an HSM or re-derive wallets from it and destroy the value immediately
  • Treat the run stage log as sensitive — the output appears in White Rabbit run logs; restrict access via RBAC
  • Set deleteAfterRecover: true only once you are certain the node shares are no longer needed
Institutional policy — pre-conditions for recovery

All of the following must be satisfied before executing:

  1. Written approval from 2-of-3 Custodians (signed email or DocuSign)
  2. Legal sign-off if driven by a regulatory request
  3. A designated receiving HSM ready to import the key within 60 seconds
  4. The full event recorded in an immutable audit log (SIEM / CloudTrail)

Abort if any condition is unmet.


Disaster Recovery Playbook

A disaster is any event that prevents the system from signing transactions: node loss, shard loss, compromise, or infrastructure failure.

Scenario 1 — Single MPC Node Goes Offline

With a 2-of-3 threshold, one node offline does not break signing. The two remaining nodes still produce valid signatures. This is a degraded state — not an emergency — but you must restore before another failure occurs.

Node A ✓ Node B ✗ (offline) Node C ✓
│ │
└──────────── 2-of-3 signing ───┘ ← still operational

Response:

  1. Confirm nodes A and C are reachable and signing normally.
  2. Provision a replacement node.
  3. Any custodian imports their shard to the new node using restoreShardToNode (Export, Rotation & Restore).
  4. Verify the new keyId matches rootPublicKey and signing works end-to-end.
  5. Clear the old keyId record from your node registry.
Threshold is specifically designed for this

Losing one node is not an emergency — it is why you chose 2-of-3. The ceremony is not repeated; no key material is lost. You simply restore one shard to a new node.


Scenario 2 — Loss of One Custodian Shard

A custodian loses access to their shard (forgotten passphrase, hardware failure, departed employee). You still have 2 shards from other custodians.

Response:

  1. Do not use RECOVER_PRIVATE_KEY — you still have quorum.
  2. Generate a new age identity for the replacement custodian (locally, per Custodian Setup).
  3. Use a remaining custodian's shard to localRewrapShard for the new custodian.
  4. Update your custodian registry.
Do not recover for a single lost shard

RECOVER_PRIVATE_KEY exposes the raw key. Use it only when you cannot sign at all. Losing one shard is a rotation event, not a disaster.


Scenario 3 — Loss of Two Custodian Shards (Catastrophic)

Two custodians are simultaneously unavailable. You are below threshold and cannot sign.

Response:

  1. Convene all available custodians and legal.
  2. SSS guardians for any remaining custodian reconstruct that custodian's age key (reconstructAgeKey from Custodian Setup).
  3. Restore the surviving shard to a temporary node (restoreShardToNode).
  4. You now have one node live — still below the 2-of-3 threshold. You need a second shard.
  5. If a second SSS reconstruction can be done for another custodian: restore a second shard. You are now at threshold and can sign.
  6. Re-export shards to newly onboarded custodians and rotate.
Why SSS guardians are critical here

The SSS guardian network is specifically designed for this scenario — when a custodian themselves is unavailable. M-of-N guardians can reconstruct the custodian's age key, which unlocks the shard, which allows the node to be restored. Without SSS, loss of two custodians means permanent loss.


Scenario 4 — Suspected Age Key or Shard Compromise

A custodian's device is breached, or you suspect their passphrase was exposed.

Response — assess blast radius first:

  • If only the shard file was copied: still protected by age encryption + passphrase. Rotate as a precaution.
  • If both the shard file and the passphrase were obtained: treat as full shard compromise. Move assets immediately.

Rotation (local, no API private key exposure):

// Run on a trusted machine, not the potentially compromised device
await localRewrapShard(
'./custodian-a-shard.json',
'./custodian-a-identity.json',
'<exposed-passphrase>',
{ name: 'custodian-a-new', agePublicKey: '<new-A-public-key>' },
);
// Old wrappedKeyShare is now cryptographically worthless to the attacker —
// it was encrypted to the old age identity, which is now invalidated.
Why local re-wrapping neutralizes a leaked shard

If an attacker has wrappedKeyShare + the old age private key, they can decrypt the shard. After local re-wrapping to a new identity, the old blob is obsolete — decrypting it yields nothing because it no longer maps to an active wrapping. Speed matters: the window is between the attacker decrypting the shard and you completing rotation.


Scenario 5 — Full Infrastructure Loss

Both the MPC node network and all but one custodian shard are lost simultaneously. You cannot sign. This is an extreme scenario.

Response:

  1. Convene quorum + legal immediately.
  2. SSS guardians reconstruct the surviving custodian's age key.
  3. Restore the surviving shard to a temporary node.
  4. If a second shard can be reconstructed from SSS: import it. You reach threshold.
  5. Call RECOVER_PRIVATE_KEY with deleteAfterRecover: false initially.
  6. Import the raw key into an HSM within 60 seconds.
  7. Move all assets to a newly generated MPC key.
  8. Decommission the recovered key — never use a recovered raw key for ongoing operations.
A recovered raw key is a bridge, not an operating mode

A raw private key in an HSM is significantly less secure than a 2-of-3 MPC key. Use recovery to move assets to safety, then generate a fresh key immediately.


Attack Prevention Policy

Understanding how attacks happen — and why the MPC architecture defeats them — helps teams build with the right threat model.

Attack Surface Map

Your application ──► WR_API_KEY ──► White Rabbit API ──► Signing request

┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ (empty) │ │ (empty) │ │ (empty) │
└─────────┘ └─────────┘ └─────────┘
│ JIT import only during signing sessions
└───────── threshold signature ──────────┘

Threat 1 — API Key Theft

An attacker steals your WR_API_KEY and can call signing operations on your behalf.

Mitigations:

  • Rotate the API key immediately on suspected exposure (Dashboard → API Keys → Revoke)
  • Use separate keys per environment — a leaked dev key cannot touch prod
  • Load keys from a secret manager (AWS Secrets Manager, Doppler, HashiCorp Vault), never .env files
  • Monitor for anomalous signing volume or unknown destination addresses
// Load from secret manager — never from process.env on production
const apiKey = await secretsManager.getSecretValue('prod/whiterabbit/api-key');
const workspace = new WorkspaceClient({ apiKey });

Threat 2 — Single MPC Node Compromise

An attacker gains access to a node's stored key share.

Why it fails: With nodes empty at rest (JIT model), there is no key material on the node to steal. If a node is compromised during a signing session, the attacker has one share — useless alone. Threshold signing requires coordinated computation across ≥ 2 nodes simultaneously.

Mitigations:

  • Nodes empty at rest (deleteAfterExport: true) — compromise yields nothing
  • Use all 3 official nodes — reduces single-node value to zero
  • Monitor node health; anomalous latency or unexpected disconnects should alert

Threat 3 — Custodian Shard Theft

An attacker copies a custodian's shard file.

The layers an attacker must break:

wrappedKeyShare (stolen file)

▼ age-encrypted (inner layer)
requires custodian's age private key

▼ envelopeEncrypt (outer layer)
requires custodian's passphrase

▼ PBKDF2 600K iterations + AES-256-GCM
brute force is computationally infeasible

Even fully breaking this only yields one shard — useless below threshold. The attacker needs two custodians' shards simultaneously.

Mitigations:

  • Store shard files and identity files in separate physical locations
  • Require a hardware token (YubiKey) as a second factor for any shard access
  • SSS means that even the age private key is not held by a single person

Threat 4 — Insider Threat

A malicious employee with API key access tries to sign unauthorized transactions.

Mitigations:

  • Transaction allow-lists — only pre-approved addresses can be recipients
  • Value-based approval gates — transfers above a threshold require a second approval
  • Time-locks on large transfers — submit intent, wait 24h, then execute (window to detect and cancel)
  • Key separation — trading key, treasury key, and hot wallet are separate MPC keys
  • Log signing requests with the authenticated user's identity, not just the API key
async function signWithApproval(params: SignParams) {
if (params.valueUsd > 10_000) {
const approved = await approvalService.requestApproval({
requester: params.requesterId,
action: 'SIGN_TRANSACTION',
value: params.valueUsd,
destination: params.to,
});
if (!approved) throw new Error('Approval denied or timed out');
}
return workspace.call(ComponentModule.SIGN_WITH_KEY_SHARE, params.signParams).promise();
}

Threat 5 — Transaction Replay / Front-Running

A signed transaction is replayed on another chain, or an attacker observes a pending transaction and front-runs it.

Mitigations:

  • Always include chainId in transactions (EIP-155) — prevents cross-chain replay
  • Manage nonce carefully in multi-pod deployments — duplicate nonces cause one transaction to drop
  • Use a private mempool (Flashbots eth_sendBundle) for MEV-sensitive operations
  • EIP-712 typed-data (Permit2, Safe signatures) embeds chainId + contract address — cross-chain replay is structurally impossible

Threat 6 — Supply Chain Attack

A malicious dependency update injects code that exfiltrates key material or signs unauthorized transactions.

Mitigations:

  • Pin exact dependency versions (package-lock.json / yarn.lock) and audit all upgrades
  • Run npm audit in CI; fail on critical vulnerabilities
  • Use a private npm registry with curated, approved packages (Artifactory, Verdaccio)
  • Sign and verify CI build artifacts (GitHub Actions OIDC + SLSA provenance)
  • Never run RECOVER_PRIVATE_KEY in the same process as untrusted third-party code

Institutional Policy Summary

ConcernRecommendation
Age key generationage-keygen locally on each custodian's device — never via the API
Age key protectionAES-256-GCM / PBKDF2 600K passphrase before writing to disk
Age key backupSSS 3-of-5; each share encrypted with a separate guardian passphrase
MPC node policyNodes empty at rest; import only to sign, delete immediately after
Re-wrappingAlways local using age-encryption npm package — never REWRAPPING_KEY_SHARE
keyId storagePrimary DB + read replica + encrypted off-site file + printed copy
Shard storage3 custodians, geographically separated, 3-2-1 rule
Passphrase strength≥ 24-character diceware or hardware token (YubiKey FIDO2)
Rotation cadenceEvery 90 days or on any custodian/guardian change
Recovery authorization2-of-3 custodian written approval + legal sign-off
Recovery destinationCertified HSM (AWS CloudHSM, Thales Luna) within 60 seconds
Audit loggingImmutable log (CloudTrail, SIEM) for all key operations
Access controlKey users cannot export; custodians cannot call signing endpoints
Disaster recovery drillFull restore-from-backup test every quarter

Pre-Launch Policy Checklist

Generation & Backup

  • Key generated: threshold: 2, all 3 official servers
  • keyId in primary DB, read replica, and printed off-site backup
  • rootPublicKey verified on-chain against a derivation path
  • 3 custodian age identities generated locally — not via API
  • Each age key split with SSS (3-of-5 recommended); shares distributed
  • 3 shards exported with deleteAfterExport: true — nodes verified empty
  • Restore drill completed: imported shard to test node, signing verified

Access Control

  • Production API key is fresh (not reused from development)
  • API keys loaded from a secret manager, not from .env files
  • Signing service account cannot export or recover keys
  • Custodians cannot call signing endpoints directly
  • Transaction allow-list configured for all signing operations

Monitoring & Response

  • Audit log for all key operations (generation, export, import, sign, recover)
  • Alerts set: anomalous signing volume, new destination addresses, off-hours signing
  • Incident runbook documented and rehearsed for all 5 disaster scenarios
  • Emergency contact list current: custodians, SSS guardians, legal, HSM provider

Ongoing Operations

  • Rotation scheduled every 90 days in the team calendar
  • Quarterly recovery drill scheduled
  • Custodian offboarding procedure documented and tested
  • Dependency audit running in CI (npm audit)