Skip to main content

Runbook icon Operations Runbook

Day-to-day operation and incident handling for AiCordCloud.

Service lifecycle

  • Start: process manager command
  • Restart: rolling restart when multiple instances exist
  • Stop: maintenance windows only

Daily checks

  1. Health endpoint status
  2. Error rate in logs
  3. p95 latency trend
  4. Queue pressure
  5. Upstream fallback ratio

Weekly checks

  1. API key rotation policy audit
  2. Dependency update window
  3. Backup validation (configs, env templates, docs)

Incident procedure

  1. Classify severity (P1/P2/P3)
  2. Capture symptoms and timeline
  3. Mitigate first (failover, temporary limits)
  4. Run RCA after stabilization
  5. Publish internal incident summary