Operations Runbook
Use this runbook for on-call diagnosis and safe recovery actions.
1) store Failures
Symptoms
- frequent
STORAGE_DEAL_FAILED - rising retry counts
- no new CIDs in active index
Actions
- check provider availability and credentials
- verify budget and balance constraints
- confirm payload size/shape constraints in your app
- retry with lower replication for non-critical data if policy allows
2) Budget Exhaustion
Symptoms
BUDGET_EXCEEDED- writes blocked while reads still work
Actions
- inspect spend trend and write patterns
- reduce
criticalusage where unnecessary - lower TTL/replication for disposable workloads
- increase
maxSpendUSDFConly with explicit approval
3) Delegation Access Issues
Symptoms
DELEGATION_VERIFICATION_FAILEDDELEGATION_EXPIRED
Actions
- confirm caller DID normalization and exact value
- check token expiration clock skew
- re-issue token with bounded TTL
- validate token CID path (wrong CID is common in relay systems)
4) Renewal Not Happening
Symptoms
- items nearing expiry with no renew events
- expected scheduled renew reports missing
Actions
- confirm renewal job is running
- inspect
renewThresholdDaysandforcebehavior - verify wallet funding and signature path if enabled
- confirm provider
renewendpoint connectivity
5) Recovery Validation (Cold Start Drill)
Run monthly:
- start a clean process
- initialize SDK with expected identity/key configuration
- load active index and validate known canary CIDs
- retrieve canary payload and verify checksum
Document time-to-recovery and any manual intervention.
Escalation Artifacts
Capture before escalation:
- failing CID/token CID samples
- exact error code and message
- provider/network identifiers
- spend snapshot and configuration hash
- recent deploy/version metadata