Disaster Recovery Runbook
At Cyberun Cloud, we do not rely on luck. We prepare for the worst. This document outlines the standard recovery procedures for Region Failures.
Recovery Objectives
Based on our asynchronous geo-mirroring architecture, we commit to the following metrics:
- RPO (Recovery Point Objective): < 10 Seconds. Maximum data loss is limited to writes occurring in the last 10 seconds prior to the failure.
- RTO (Recovery Time Objective): < 5 Minutes. Time elapsed from confirming a failure as "unrecoverable" to services coming back online in the standby region.
Failover Workflow
stateDiagram-v2
direction TB
state "Normal Operation" as Normal
state "Failure Detection" as Detect
state "Karmada Reschedule" as Reschedule
state "DNS Switchover" as DNS
state "Service Restored" as Recovered
Normal --> Detect : Heartbeat Miss > 30s
Detect --> Reschedule : Mark Cluster Unhealthy
Reschedule --> DNS : Update GeoDNS Records
DNS --> Recovered : Route Traffic to Standby
Drill Scenario: Total Blackout in New York
Assuming the Destroyer (NY) cluster goes completely offline due to force majeure:
- Auto-Detection: The Tokyo control plane (
Carrier) detects New York node status asUnknownwithin 30 seconds. - Liveness Probe: The system triggers external Ping tests to New York edge gateways to confirm it is not just a control plane network jitter.
- Evacuation Command: Operations team (or automated Operator) executes
karmadactl cordon cluster destroyer-ny. - Workload Migration: Karmada automatically scales up Deployment replicas in
Aegis (DE)or other standby clusters. - Storage Mounting: The standby cluster connects via WireGuard to the offsite data replica (if primary storage is also offline, it connects to the asynchronous mirror slave).