Operations Runbook
Day-to-day operation and incident handling for AiCordCloud.
Service lifecycle
- Start: process manager command
- Restart: rolling restart when multiple instances exist
- Stop: maintenance windows only
Daily checks
- Health endpoint status
- Error rate in logs
- p95 latency trend
- Queue pressure
- Upstream fallback ratio
Weekly checks
- API key rotation policy audit
- Dependency update window
- Backup validation (configs, env templates, docs)
Incident procedure
- Classify severity (P1/P2/P3)
- Capture symptoms and timeline
- Mitigate first (failover, temporary limits)
- Run RCA after stabilization
- Publish internal incident summary