Skip to content

High Availability & Resilience

In a distributed system spanning multiple continents and providers, failure is not a possibility—it is a certainty. Cyberun Cloud is architected with a "Defense in Depth" strategy, ensuring that failures are isolated, contained, and self-healed.

Layer 1: Ingress Resilience (The Front Door)

What happens if a Load Balancer node fails?

  • Technology: Keepalived (VRRP)
  • Mechanism: We operate HAProxy nodes in pairs. They constantly heartbeat each other. If the Master node misses a beat, the Backup node instantly claims the Floating IP (VIP).
  • Impact: < 2 seconds of downtime. Existing TCP connections may reset, but the service remains reachable.

Layer 2: Control Plane Resilience (The Brain)

What happens if the API Server becomes unreachable from the internet?

  • Technology: Localhost Proxy (Kubespray)
  • Mechanism: Every Worker node runs a lightweight Nginx proxy locally listening on 127.0.0.1:6443. This proxy load-balances requests to all healthy Control Plane nodes.
  • Impact: Even if the external Load Balancer (Layer 1) completely fails, internal cluster communications (Kubelet <-> API Server) continue uninterrupted. The cluster does not "fall apart."

Layer 3: Application Resilience (The Workload)

What happens if a Compute Node crashes?

  • Technology: Kubernetes Controller Manager & Karmada
  • Local Failure: The local Kubernetes scheduler detects the node is NotReady and reschedules Pods to healthy nodes within the same cluster.
  • Regional Failure: If an entire cluster (e.g., Destroyer) goes dark, Karmada detects the cluster health change and propagates the workload to a standby cluster (e.g., Aegis) based on your failover policies.

Layer 4: Data Resilience (The Vault)

What happens if a Storage Drive dies?

  • Technology: Ceph Self-Healing
  • Mechanism: Data in Ceph is replicated 3 times (Replica=3) across different physical nodes.
  • Failure Scenario: When a drive fails, Ceph marks the OSD as down. It immediately begins "Backfilling"—reconstructing the missing data chunks on other available drives to restore full redundancy.
  • Impact: Zero Data Loss. Performance may slightly degrade during rebuilding, but data integrity is mathematically guaranteed.

Disaster Recovery (DR) Strategy

For catastrophic scenarios (e.g., a data center fire), we implement:

  1. Volume Snapshots: Automated snapshots (via VolumeSnapshotClass) allow point-in-time recovery from logical corruption (e.g., accidental deletion).
  2. Offsite Backup: Critical state (etcd backups, PV snapshots) is shipped to S3-compatible object storage in a geographically separate region (e.g., from NY to Tokyo).