Skip to content

Disaster Recovery Runbook

At Cyberun Cloud, we do not rely on luck. We prepare for the worst. This document outlines the standard recovery procedures for Region Failures.

Recovery Objectives

Based on our asynchronous geo-mirroring architecture, we commit to the following metrics:

  • RPO (Recovery Point Objective): < 10 Seconds. Maximum data loss is limited to writes occurring in the last 10 seconds prior to the failure.
  • RTO (Recovery Time Objective): < 5 Minutes. Time elapsed from confirming a failure as "unrecoverable" to services coming back online in the standby region.

Failover Workflow

stateDiagram-v2
    direction TB

    state "Normal Operation" as Normal
    state "Failure Detection" as Detect
    state "Karmada Reschedule" as Reschedule
    state "DNS Switchover" as DNS
    state "Service Restored" as Recovered

    Normal --> Detect : Heartbeat Miss > 30s
    Detect --> Reschedule : Mark Cluster Unhealthy
    Reschedule --> DNS : Update GeoDNS Records
    DNS --> Recovered : Route Traffic to Standby

Drill Scenario: Total Blackout in New York

Assuming the Destroyer (NY) cluster goes completely offline due to force majeure:

  1. Auto-Detection: The Tokyo control plane (Carrier) detects New York node status as Unknown within 30 seconds.
  2. Liveness Probe: The system triggers external Ping tests to New York edge gateways to confirm it is not just a control plane network jitter.
  3. Evacuation Command: Operations team (or automated Operator) executes karmadactl cordon cluster destroyer-ny.
  4. Workload Migration: Karmada automatically scales up Deployment replicas in Aegis (DE) or other standby clusters.
  5. Storage Mounting: The standby cluster connects via WireGuard to the offsite data replica (if primary storage is also offline, it connects to the asynchronous mirror slave).