Disaster Recovery & Attack Prevention

This page covers what to do when things go wrong — and how the MPC architecture and layered encryption policies make most attacks structurally infeasible.

Emergency Private Key Recovery

Recovery of the raw private key is a last-resort operation used only for migration to a new system, legal holds, or catastrophic infrastructure loss. It requires quorum approval and is irreversible if deleteAfterRecover: true is set.

import { WorkspaceClient, ComponentModule } from 'caller-sdk';

const workspace = new WorkspaceClient({ apiKey: process.env.WR_API_KEY! });

async function emergencyRecoverPrivateKey(
  keyId: string,
  quorumApprovers: string[], // logged for audit — minimum 2
): Promise<string> {
  if (quorumApprovers.length < 2) {
    throw new Error('Recovery requires at least 2 quorum approvers');
  }

  // Log BEFORE proceeding — the record must exist even if recovery fails
  await audit.log({
    event:     'PRIVATE_KEY_RECOVERY_INITIATED',
    keyId,
    approvers: quorumApprovers,
    timestamp: new Date().toISOString(),
  });

  // Combines threshold shares from 2+ nodes to reconstruct the raw private key.
  // This is the ONLY White Rabbit operation that produces plaintext key material.
  const { privateKey } = await workspace
    .call(ComponentModule.RECOVER_PRIVATE_KEY, {
      keyId,
      // deleteAfterRecover: true — irreversible; only set once you are certain
      // the node shares are no longer needed.
    })
    .promise();

  await audit.log({
    event:     'PRIVATE_KEY_RECOVERY_COMPLETED',
    keyId,
    approvers: quorumApprovers,
    timestamp: new Date().toISOString(),
  });

  // Transfer to HSM immediately — never log, never write to disk in plaintext.
  return privateKey;
}

RECOVER_PRIVATE_KEY exposes raw key material

The privateKey in the output is a plaintext hex-encoded private key. Handle it accordingly:

Never log it — exclude from structured logging pipelines
Never write it to disk in plaintext — transfer directly into an HSM or re-derive wallets from it and destroy the value immediately
Treat the run stage log as sensitive — the output appears in White Rabbit run logs; restrict access via RBAC
Set deleteAfterRecover: true only once you are certain the node shares are no longer needed

Institutional policy — pre-conditions for recovery

All of the following must be satisfied before executing:

Written approval from 2-of-3 Custodians (signed email or DocuSign)
Legal sign-off if driven by a regulatory request
A designated receiving HSM ready to import the key within 60 seconds
The full event recorded in an immutable audit log (SIEM / CloudTrail)

Abort if any condition is unmet.

Disaster Recovery Playbook

A disaster is any event that prevents the system from signing transactions: node loss, shard loss, compromise, or infrastructure failure.

Scenario 1 — Single MPC Node Goes Offline

With a 2-of-3 threshold, one node offline does not break signing. The two remaining nodes still produce valid signatures. This is a degraded state — not an emergency — but you must restore before another failure occurs.

Node A ✓   Node B ✗ (offline)   Node C ✓
│                                │
└──────────── 2-of-3 signing ───┘  ← still operational

Response:

Confirm nodes A and C are reachable and signing normally.
Provision a replacement node.
Any custodian imports their shard to the new node using restoreShardToNode (Export, Rotation & Restore).
Verify the new keyId matches rootPublicKey and signing works end-to-end.
Clear the old keyId record from your node registry.

Threshold is specifically designed for this

Losing one node is not an emergency — it is why you chose 2-of-3. The ceremony is not repeated; no key material is lost. You simply restore one shard to a new node.

Scenario 2 — Loss of One Custodian Shard

A custodian loses access to their shard (forgotten passphrase, hardware failure, departed employee). You still have 2 shards from other custodians.

Response:

Do not use RECOVER_PRIVATE_KEY — you still have quorum.
Generate a new age identity for the replacement custodian (locally, per Custodian Setup).
Use a remaining custodian's shard to localRewrapShard for the new custodian.
Update your custodian registry.

Do not recover for a single lost shard

RECOVER_PRIVATE_KEY exposes the raw key. Use it only when you cannot sign at all. Losing one shard is a rotation event, not a disaster.

Scenario 3 — Loss of Two Custodian Shards (Catastrophic)

Two custodians are simultaneously unavailable. You are below threshold and cannot sign.

Response:

Convene all available custodians and legal.
SSS guardians for any remaining custodian reconstruct that custodian's age key (reconstructAgeKey from Custodian Setup).
Restore the surviving shard to a temporary node (restoreShardToNode).
You now have one node live — still below the 2-of-3 threshold. You need a second shard.
If a second SSS reconstruction can be done for another custodian: restore a second shard. You are now at threshold and can sign.
Re-export shards to newly onboarded custodians and rotate.

Why SSS guardians are critical here

The SSS guardian network is specifically designed for this scenario — when a custodian themselves is unavailable. M-of-N guardians can reconstruct the custodian's age key, which unlocks the shard, which allows the node to be restored. Without SSS, loss of two custodians means permanent loss.

Scenario 4 — Suspected Age Key or Shard Compromise

A custodian's device is breached, or you suspect their passphrase was exposed.

Response — assess blast radius first:

If only the shard file was copied: still protected by age encryption + passphrase. Rotate as a precaution.
If both the shard file and the passphrase were obtained: treat as full shard compromise. Move assets immediately.

Rotation (local, no API private key exposure):

// Run on a trusted machine, not the potentially compromised device
await localRewrapShard(
  './custodian-a-shard.json',
  './custodian-a-identity.json',
  '<exposed-passphrase>',
  { name: 'custodian-a-new', agePublicKey: '<new-A-public-key>' },
);
// Old wrappedKeyShare is now cryptographically worthless to the attacker —
// it was encrypted to the old age identity, which is now invalidated.

Why local re-wrapping neutralizes a leaked shard

If an attacker has wrappedKeyShare + the old age private key, they can decrypt the shard. After local re-wrapping to a new identity, the old blob is obsolete — decrypting it yields nothing because it no longer maps to an active wrapping. Speed matters: the window is between the attacker decrypting the shard and you completing rotation.

Scenario 5 — Full Infrastructure Loss

Both the MPC node network and all but one custodian shard are lost simultaneously. You cannot sign. This is an extreme scenario.

Response:

Convene quorum + legal immediately.
SSS guardians reconstruct the surviving custodian's age key.
Restore the surviving shard to a temporary node.
If a second shard can be reconstructed from SSS: import it. You reach threshold.
Call RECOVER_PRIVATE_KEY with deleteAfterRecover: false initially.
Import the raw key into an HSM within 60 seconds.
Move all assets to a newly generated MPC key.
Decommission the recovered key — never use a recovered raw key for ongoing operations.

A recovered raw key is a bridge, not an operating mode

A raw private key in an HSM is significantly less secure than a 2-of-3 MPC key. Use recovery to move assets to safety, then generate a fresh key immediately.

Attack Prevention Policy

Understanding how attacks happen — and why the MPC architecture defeats them — helps teams build with the right threat model.

Attack Surface Map

  Your application ──► WR_API_KEY ──► White Rabbit API ──► Signing request
                                              │
                          ┌───────────────────┼───────────────────┐
                          ▼                   ▼                   ▼
                     ┌─────────┐         ┌─────────┐         ┌─────────┐
                     │ Node 1  │         │ Node 2  │         │ Node 3  │
                     │ (empty) │         │ (empty) │         │ (empty) │
                     └─────────┘         └─────────┘         └─────────┘
                          │ JIT import only during signing sessions
                          └───────── threshold signature ──────────┘

Threat 1 — API Key Theft

An attacker steals your WR_API_KEY and can call signing operations on your behalf.

Mitigations:

Rotate the API key immediately on suspected exposure (Dashboard → API Keys → Revoke)
Use separate keys per environment — a leaked dev key cannot touch prod
Load keys from a secret manager (AWS Secrets Manager, Doppler, HashiCorp Vault), never .env files
Monitor for anomalous signing volume or unknown destination addresses

// Load from secret manager — never from process.env on production
const apiKey = await secretsManager.getSecretValue('prod/whiterabbit/api-key');
const workspace = new WorkspaceClient({ apiKey });

Threat 2 — Single MPC Node Compromise

An attacker gains access to a node's stored key share.

Why it fails: With nodes empty at rest (JIT model), there is no key material on the node to steal. If a node is compromised during a signing session, the attacker has one share — useless alone. Threshold signing requires coordinated computation across ≥ 2 nodes simultaneously.

Mitigations:

Nodes empty at rest (deleteAfterExport: true) — compromise yields nothing
Use all 3 official nodes — reduces single-node value to zero
Monitor node health; anomalous latency or unexpected disconnects should alert

Threat 3 — Custodian Shard Theft

An attacker copies a custodian's shard file.

The layers an attacker must break:

wrappedKeyShare (stolen file)
     │
     ▼  age-encrypted (inner layer)
         requires custodian's age private key
              │
              ▼  envelopeEncrypt (outer layer)
                  requires custodian's passphrase
                       │
                       ▼  PBKDF2 600K iterations + AES-256-GCM
                           brute force is computationally infeasible

Even fully breaking this only yields one shard — useless below threshold. The attacker needs two custodians' shards simultaneously.

Mitigations:

Store shard files and identity files in separate physical locations
Require a hardware token (YubiKey) as a second factor for any shard access
SSS means that even the age private key is not held by a single person

Threat 4 — Insider Threat

A malicious employee with API key access tries to sign unauthorized transactions.

Mitigations:

Transaction allow-lists — only pre-approved addresses can be recipients
Value-based approval gates — transfers above a threshold require a second approval
Time-locks on large transfers — submit intent, wait 24h, then execute (window to detect and cancel)
Key separation — trading key, treasury key, and hot wallet are separate MPC keys
Log signing requests with the authenticated user's identity, not just the API key

async function signWithApproval(params: SignParams) {
  if (params.valueUsd > 10_000) {
    const approved = await approvalService.requestApproval({
      requester:   params.requesterId,
      action:      'SIGN_TRANSACTION',
      value:       params.valueUsd,
      destination: params.to,
    });
    if (!approved) throw new Error('Approval denied or timed out');
  }
  return workspace.call(ComponentModule.SIGN_WITH_KEY_SHARE, params.signParams).promise();
}

Threat 5 — Transaction Replay / Front-Running

A signed transaction is replayed on another chain, or an attacker observes a pending transaction and front-runs it.

Mitigations:

Always include chainId in transactions (EIP-155) — prevents cross-chain replay
Manage nonce carefully in multi-pod deployments — duplicate nonces cause one transaction to drop
Use a private mempool (Flashbots eth_sendBundle) for MEV-sensitive operations
EIP-712 typed-data (Permit2, Safe signatures) embeds chainId + contract address — cross-chain replay is structurally impossible

Threat 6 — Supply Chain Attack

A malicious dependency update injects code that exfiltrates key material or signs unauthorized transactions.

Mitigations:

Pin exact dependency versions (package-lock.json / yarn.lock) and audit all upgrades
Run npm audit in CI; fail on critical vulnerabilities
Use a private npm registry with curated, approved packages (Artifactory, Verdaccio)
Sign and verify CI build artifacts (GitHub Actions OIDC + SLSA provenance)
Never run RECOVER_PRIVATE_KEY in the same process as untrusted third-party code

Institutional Policy Summary

Concern	Recommendation
Age key generation	`age-keygen` locally on each custodian's device — never via the API
Age key protection	AES-256-GCM / PBKDF2 600K passphrase before writing to disk
Age key backup	SSS 3-of-5; each share encrypted with a separate guardian passphrase
MPC node policy	Nodes empty at rest; import only to sign, delete immediately after
Re-wrapping	Always local using `age-encryption` npm package — never `REWRAPPING_KEY_SHARE`
keyId storage	Primary DB + read replica + encrypted off-site file + printed copy
Shard storage	3 custodians, geographically separated, 3-2-1 rule
Passphrase strength	≥ 24-character diceware or hardware token (YubiKey FIDO2)
Rotation cadence	Every 90 days or on any custodian/guardian change
Recovery authorization	2-of-3 custodian written approval + legal sign-off
Recovery destination	Certified HSM (AWS CloudHSM, Thales Luna) within 60 seconds
Audit logging	Immutable log (CloudTrail, SIEM) for all key operations
Access control	Key users cannot export; custodians cannot call signing endpoints
Disaster recovery drill	Full restore-from-backup test every quarter

Pre-Launch Policy Checklist

Generation & Backup

Key generated: threshold: 2, all 3 official servers
keyId in primary DB, read replica, and printed off-site backup
rootPublicKey verified on-chain against a derivation path
3 custodian age identities generated locally — not via API
Each age key split with SSS (3-of-5 recommended); shares distributed
3 shards exported with deleteAfterExport: true — nodes verified empty
Restore drill completed: imported shard to test node, signing verified

Access Control

Production API key is fresh (not reused from development)
API keys loaded from a secret manager, not from .env files
Signing service account cannot export or recover keys
Custodians cannot call signing endpoints directly
Transaction allow-list configured for all signing operations

Monitoring & Response

Audit log for all key operations (generation, export, import, sign, recover)
Alerts set: anomalous signing volume, new destination addresses, off-hours signing
Incident runbook documented and rehearsed for all 5 disaster scenarios
Emergency contact list current: custodians, SSS guardians, legal, HSM provider

Ongoing Operations

Rotation scheduled every 90 days in the team calendar
Quarterly recovery drill scheduled
Custodian offboarding procedure documented and tested
Dependency audit running in CI (npm audit)

Emergency Private Key Recovery​

Disaster Recovery Playbook​

Scenario 1 — Single MPC Node Goes Offline​

Scenario 2 — Loss of One Custodian Shard​

Scenario 3 — Loss of Two Custodian Shards (Catastrophic)​

Scenario 4 — Suspected Age Key or Shard Compromise​

Scenario 5 — Full Infrastructure Loss​

Attack Prevention Policy​

Attack Surface Map​

Threat 1 — API Key Theft​

Threat 2 — Single MPC Node Compromise​

Threat 3 — Custodian Shard Theft​

Threat 4 — Insider Threat​

Threat 5 — Transaction Replay / Front-Running​

Threat 6 — Supply Chain Attack​

Institutional Policy Summary​

Pre-Launch Policy Checklist​

Generation & Backup​

Access Control​

Monitoring & Response​

Ongoing Operations​

Emergency Private Key Recovery

Disaster Recovery Playbook

Scenario 1 — Single MPC Node Goes Offline

Scenario 2 — Loss of One Custodian Shard

Scenario 3 — Loss of Two Custodian Shards (Catastrophic)

Scenario 4 — Suspected Age Key or Shard Compromise

Scenario 5 — Full Infrastructure Loss

Attack Prevention Policy

Attack Surface Map

Threat 1 — API Key Theft

Threat 2 — Single MPC Node Compromise

Threat 3 — Custodian Shard Theft

Threat 4 — Insider Threat

Threat 5 — Transaction Replay / Front-Running

Threat 6 — Supply Chain Attack

Institutional Policy Summary

Pre-Launch Policy Checklist

Generation & Backup

Access Control

Monitoring & Response

Ongoing Operations