Skip to content

Operations Runbook

Use this runbook for on-call diagnosis and safe recovery actions.

1) store Failures

Symptoms

  • frequent STORAGE_DEAL_FAILED
  • rising retry counts
  • no new CIDs in active index

Actions

  1. check provider availability and credentials
  2. verify budget and balance constraints
  3. confirm payload size/shape constraints in your app
  4. retry with lower replication for non-critical data if policy allows

2) Budget Exhaustion

Symptoms

  • BUDGET_EXCEEDED
  • writes blocked while reads still work

Actions

  1. inspect spend trend and write patterns
  2. reduce critical usage where unnecessary
  3. lower TTL/replication for disposable workloads
  4. increase maxSpendUSDFC only with explicit approval

3) Delegation Access Issues

Symptoms

  • DELEGATION_VERIFICATION_FAILED
  • DELEGATION_EXPIRED

Actions

  1. confirm caller DID normalization and exact value
  2. check token expiration clock skew
  3. re-issue token with bounded TTL
  4. validate token CID path (wrong CID is common in relay systems)

4) Renewal Not Happening

Symptoms

  • items nearing expiry with no renew events
  • expected scheduled renew reports missing

Actions

  1. confirm renewal job is running
  2. inspect renewThresholdDays and force behavior
  3. verify wallet funding and signature path if enabled
  4. confirm provider renew endpoint connectivity

5) Recovery Validation (Cold Start Drill)

Run monthly:

  1. start a clean process
  2. initialize SDK with expected identity/key configuration
  3. load active index and validate known canary CIDs
  4. retrieve canary payload and verify checksum

Document time-to-recovery and any manual intervention.

Escalation Artifacts

Capture before escalation:

  • failing CID/token CID samples
  • exact error code and message
  • provider/network identifiers
  • spend snapshot and configuration hash
  • recent deploy/version metadata

Released under the MIT License.