Zero-Downtime Maintenance
In the cloud era, "Maintenance Windows" should be an obsolete concept. Cyberun Cloud is architected with a singular goal: Infrastructure changes should never cause business interruption.
By combining automation with Kubernetes native scheduling, we achieve full-stack zero-downtime maintenance.
OS Patching & Kernel Updates
stateDiagram-v2
direction TB
state "1. Active Node" as Active
state "2. Cordoned" as Cordon
state "3. Draining" as Drain
state "4. Reboot/Patch" as Maint
state "5. Health Check" as Check
Active --> Cordon : Mark Unschedulable
Cordon --> Drain : Evict Pods Gracefully
Drain --> Maint : Traffic Drained
Maint --> Check : Reboot Complete
Check --> Active : Rejoin Cluster
note right of Drain
Protected by PDB
Ensures Replicas > 80%
end note
To fix underlying Linux kernel vulnerabilities (CVEs), physical nodes must be rebooted. How do we do this without impacting business?
- Cordon: Automation scripts first mark the target node as
Unschedulable, preventing new traffic from entering. - Drain: The system sends a
SIGTERMsignal to all Pods on the node.- Graceful Termination: Your application gets 30 seconds (default) to finish current requests, close DB connections, and save state.
- PDB Protection: We strictly adhere to
PodDisruptionBudget, ensuring that at no time does the number of healthy replicas for a service fall below a defined threshold (e.g., 80%).
- Rolling Reboot: We never reboot all nodes simultaneously. Ansible executes reboots sequentially, rack by rack, ensuring cluster capacity remains ample.
Kubernetes Upgrades
Control plane upgrades are completely transparent to the user.
- Blue/Green Control Plane: When upgrading the Carrier cluster, we spin up replicas of the new API Server version. Traffic is seamlessly switched only after health checks pass.
- Compatibility Guarantee: We strictly follow an N-2 Version Policy, ensuring Storage Drivers (CSI) and Network Plugins (CNI) remain backward compatible during upgrades.
Application Deployment (GitOps)
For user-deployed applications, Cyberun enforces a Rolling Update strategy via FluxCD by default:
strategy:
rollingUpdate:
maxSurge: 25% # Allow 25% extra resources to spin up new version
maxUnavailable: 0 # Do not allow any unavailability during upgrade
This means that when deploying new code, old Pods are only terminated after the new Pods have passed their ReadinessProbe. Traffic is always routed only to healthy instances.