Skip to content

6.0 - Cluster Maintenance

6.1 - OS Upgrades

  • Suppose you have a cluster with a few nodes and pods serving applications; what happens if one of these nodes goes down?
  • Associated pods are rendered inaccessible
  • Depending on the deployment method of these PODs, users may be impacted
  • If multiple replicas of the pod are spread across the cluster, users are uninterrupted as it's still accessible
  • Any pods running ONLY on that node however will experience downtime
  • Kubernetes will automatically try and restart the node
  • If it comes back on immediately, kubectl restarts and the pods restart
  • If after 5 mins and it's not back online, Kubernetes considers the pods as dead and terminates them from the node
    • If part of a replicaset, the pods will be recreated on other nodes
  • The time it takes for a pod to come back online is the pod eviction timeout
  • Can be set on the controller manager via: kube-controller-manager --pod-eviction-timeout=xmys
    • X,y = integer values
  • If the node comes back online after the timeout it restarts as a blank node, any pods that were on it and not part of a replicaset will remain "gone"
  • Therefore, if maintenance is required on a node that is likely to come back within 5 minutes, and workloads on it are also available on other nodes, it's fine for it to be temporarily taken down for upgrades
  • There is no guarantee that it'll reboot within the 5 minutes
  • Nodes can be "drained", a process where they are gracefully terminated and deployed on other nodes
  • Done so via: kubectl drain <node name>
  • Node cordoned and made unschedulable
  • To uncordon node: kubectl uncordon <nodename>
  • To mark the node as unschedulable, run: kubectl cordon <nodename>
  • Doesn't terminate any preexisting pods, just stops any more from being scheduled
  • Note: May need the flag --ignore-daemonsets and or --force

6.2 - Kubernetes Software Versions

  • When installing a kubernetes cluster, a specific version of kubernetes is installed alongside
  • Can be viewed via kubectl get nodes in the version column
  • Release versions follow the process major.minor.patch
  • Kubernetes is regularly updated with new minor versions every couple of months
  • Alpha and beta versions also available
  • Alpha - Features disabled by default, likely to be buggy
  • Beta - Code tested, new features enabled by default
  • Stable release - Code tested, bugs fixed
  • Kubernetes releases found in a tarball file in Github; contains all required executables of the same version
  • Note: Some components within the control plane will not have the same version numbers and are released as separate files; ETCD cluster and CoreDNS servers being the main examples

6.3 - Cluster Upgrade Process

  • The kubernetes components don't all have to be at the same versions
  • No component should be at a version higher than the kube-api server
  • If Kube-API Server is version X (a minor release), then the following ranges apply for the other components for support level:
    • Controller manager: X-1
    • Kube-Scheduler: X-1
    • Kubelet: X-2
    • Kube-Proxy: X-2
    • Kubectl: X-1 - X+1
  • At any point, Kubernetes only supports the 3 most recent minor releases e.g. 1.19 - 1.17
  • It's better to upgrade iteratively over minor processes e.g. 1.17 - 1.18 and so on
  • Upgrade process = Cluster-Dependent
  • If on a cloud provider, built-in functionality available
  • If on kubeadm/manually created cluster, must use commands:
  • kubeadm upgrade plan
  • kubeadm upgrade apply
  • Cluster upgrades involve two steps:
  • Upgrade the master node
    • All management components go down temporarily during the processes
    • Doesn't impact the current node workloads (only if you try to do anything with them)
  • Upgrade the worker nodes
    • Can be done all at once - Results in downtime
    • Can be done iteratively - Minimal downtime by draining nodes as they get upgraded one after another
    • Could also add new nodes with the most recent software versions
  • Proves especially inconvenient when on a cloud provider
  • Upgrading via Kubeadm:
  • kubeadm upgrade plan
    • Lists latest versions available
    • Components that must be upgraded manually
    • Command to upgrade kubeadm
  • Note: kubeadm itself must be upgraded first: apt-get upgrade -y kubeadm=major.minor-patch_min-patch_max
  • Check upgrade success based on CLI output and kubectl get nodes
  • If Kubelet is running on Master node, this must be upgraded next the master node and restart the service:
  • apt-get upgrade -y kubelet=1.12.0-00
  • systemctl restart kubelet
  • Upgrading the worker nodes:
  • Use the drain command to stop and transfer the current workloads to other nodes, then upgrade the following for each node (ssh into each one):
    • Kubeadm
    • Kubelet
    • Node config: kubeadm upgrade node config --kubelet-version major.minor.patch
  • Restart the service: systemctl restart kubelet
  • Make sure to uncordon each node after each upgrade!

6.4 - Backup and Restore Methods

  • It's good practice to save resource configuration definition files
  • Kube-api server can be used to query all resources to get yaml files for each
  • E.g. kubectl get all --all-namespaces -o yaml > filename.yaml
  • The etcd cluster stores information about the state of the cluster e.g. what nodes are on it and what applications are they running
  • When configuring the etcd, you can configure the data directory for the etcd data store via the --data-dir flag
  • You can take a snapshot of the etcd database using the etcdctl utility
  • To restore the cluster from the backup:
  • Service kube-apiserver stop
  • etcdctl snapshot restore snapshot.db --options
  • New data store directory created
  • The etcd service file must then be reconfigured for the new cluster token and data directory
  • Reload the daemon and restart the service
  • Backup candidates:
  • Kube-API Server query - Generally the more common method
  • ETCD Server

Disaster Recovery with ETCD in Kubernetes

Assuming ETCDCTL is installed, use it to take a snapshot, make sure to specify the flags, which can all be found via examining the etcd pod and ARE MANDATORY for authentication:

ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save PATH/TO/BACKUP/BACKUP.db
  • Suppose disaster happens:
  • Restore the snapshot to a new folder:
ETCDCTL_API=3 etcdctl --data-dir /var/lib/etcd-from-backup \
snapshot restore PATH/TO/BACKUP/BACKUP.db
  • Update the etcd pod's volume hostpath and mount paths for etcd-data to be /var/lib/etcd-from-backup etc as appropriate by updating the yaml file at /etc/kubernetes/manifests/etcd.yaml
  • The etcd pod should automatically restart once this update is done, bringing back the pods stored in the backup along with it. (Use watch "docker ps | grep etcd" to track)

Working with ETCDCTL

  • For backup and restore purposes, make sure to set the ETCDCTL API to 3: export ETCDCTL_API=3
  • For taking a snapshot of the etcd cluster: etcdctl snapshot save -h and keep a note of the mandatory global options.
  • For a TLS-Enabled ETCD Database, the following are mandatory:
  • --cacert
  • --cert
  • --endpoints[IP:PORT]
  • --key
  • Use the snapshot restore option for backup: etcdctl snapshot restore -h
  • Note options available and apply as appropriate