Skip to content

Zero-Downtime Maintenance

In the cloud era, "Maintenance Windows" should be an obsolete concept. Cyberun Cloud is architected with a singular goal: Infrastructure changes should never cause business interruption.

By combining automation with Kubernetes native scheduling, we achieve full-stack zero-downtime maintenance.

OS Patching & Kernel Updates

stateDiagram-v2
    direction TB

    state "1. Active Node" as Active
    state "2. Cordoned" as Cordon
    state "3. Draining" as Drain
    state "4. Reboot/Patch" as Maint
    state "5. Health Check" as Check

    Active --> Cordon : Mark Unschedulable
    Cordon --> Drain : Evict Pods Gracefully
    Drain --> Maint : Traffic Drained
    Maint --> Check : Reboot Complete
    Check --> Active : Rejoin Cluster

    note right of Drain
        Protected by PDB
        Ensures Replicas > 80%
    end note

To fix underlying Linux kernel vulnerabilities (CVEs), physical nodes must be rebooted. How do we do this without impacting business?

  1. Cordon: Automation scripts first mark the target node as Unschedulable, preventing new traffic from entering.
  2. Drain: The system sends a SIGTERM signal to all Pods on the node.
    • Graceful Termination: Your application gets 30 seconds (default) to finish current requests, close DB connections, and save state.
    • PDB Protection: We strictly adhere to PodDisruptionBudget, ensuring that at no time does the number of healthy replicas for a service fall below a defined threshold (e.g., 80%).
  3. Rolling Reboot: We never reboot all nodes simultaneously. Ansible executes reboots sequentially, rack by rack, ensuring cluster capacity remains ample.

Kubernetes Upgrades

Control plane upgrades are completely transparent to the user.

  • Blue/Green Control Plane: When upgrading the Carrier cluster, we spin up replicas of the new API Server version. Traffic is seamlessly switched only after health checks pass.
  • Compatibility Guarantee: We strictly follow an N-2 Version Policy, ensuring Storage Drivers (CSI) and Network Plugins (CNI) remain backward compatible during upgrades.

Application Deployment (GitOps)

For user-deployed applications, Cyberun enforces a Rolling Update strategy via FluxCD by default:

strategy:
  rollingUpdate:
    maxSurge: 25% # Allow 25% extra resources to spin up new version
    maxUnavailable: 0 # Do not allow any unavailability during upgrade

This means that when deploying new code, old Pods are only terminated after the new Pods have passed their ReadinessProbe. Traffic is always routed only to healthy instances.