High Availability & Resilience
In a distributed system spanning multiple continents and providers, failure is not a possibility—it is a certainty. Cyberun Cloud is architected with a "Defense in Depth" strategy, ensuring that failures are isolated, contained, and self-healed.
Layer 1: Ingress Resilience (The Front Door)
What happens if a Load Balancer node fails?
- Technology: Keepalived (VRRP)
- Mechanism: We operate HAProxy nodes in pairs. They constantly heartbeat each other. If the Master node misses a beat, the Backup node instantly claims the Floating IP (VIP).
- Impact: < 2 seconds of downtime. Existing TCP connections may reset, but the service remains reachable.
Layer 2: Control Plane Resilience (The Brain)
What happens if the API Server becomes unreachable from the internet?
- Technology: Localhost Proxy (Kubespray)
- Mechanism: Every Worker node runs a lightweight Nginx proxy locally listening on
127.0.0.1:6443. This proxy load-balances requests to all healthy Control Plane nodes. - Impact: Even if the external Load Balancer (Layer 1) completely fails, internal cluster communications (Kubelet <-> API Server) continue uninterrupted. The cluster does not "fall apart."
Layer 3: Application Resilience (The Workload)
What happens if a Compute Node crashes?
- Technology: Kubernetes Controller Manager & Karmada
- Local Failure: The local Kubernetes scheduler detects the node is
NotReadyand reschedules Pods to healthy nodes within the same cluster. - Regional Failure: If an entire cluster (e.g., Destroyer) goes dark, Karmada detects the cluster health change and propagates the workload to a standby cluster (e.g., Aegis) based on your failover policies.
Layer 4: Data Resilience (The Vault)
What happens if a Storage Drive dies?
- Technology: Ceph Self-Healing
- Mechanism: Data in Ceph is replicated 3 times (Replica=3) across different physical nodes.
- Failure Scenario: When a drive fails, Ceph marks the OSD as
down. It immediately begins "Backfilling"—reconstructing the missing data chunks on other available drives to restore full redundancy. - Impact: Zero Data Loss. Performance may slightly degrade during rebuilding, but data integrity is mathematically guaranteed.
Disaster Recovery (DR) Strategy
For catastrophic scenarios (e.g., a data center fire), we implement:
- Volume Snapshots: Automated snapshots (via VolumeSnapshotClass) allow point-in-time recovery from logical corruption (e.g., accidental deletion).
- Offsite Backup: Critical state (etcd backups, PV snapshots) is shipped to S3-compatible object storage in a geographically separate region (e.g., from NY to Tokyo).