6.0 - Cluster Maintenance¶

6.1 - OS Upgrades¶

Suppose you have a cluster with a few nodes and pods serving applications; what happens if one of these nodes goes down?
Associated pods are rendered inaccessible
Depending on the deployment method of these PODs, users may be impacted
If multiple replicas of the pod are spread across the cluster, users are uninterrupted as it's still accessible
Any pods running ONLY on that node however will experience downtime
Kubernetes will automatically try and restart the node
If it comes back on immediately, kubectl restarts and the pods restart
If after 5 mins and it's not back online, Kubernetes considers the pods as dead and terminates them from the node
- If part of a replicaset, the pods will be recreated on other nodes
The time it takes for a pod to come back online is the pod eviction timeout
Can be set on the controller manager via: kube-controller-manager --pod-eviction-timeout=xmys
- X,y = integer values
If the node comes back online after the timeout it restarts as a blank node, any pods that were on it and not part of a replicaset will remain "gone"
Therefore, if maintenance is required on a node that is likely to come back within 5 minutes, and workloads on it are also available on other nodes, it's fine for it to be temporarily taken down for upgrades
There is no guarantee that it'll reboot within the 5 minutes
Nodes can be "drained", a process where they are gracefully terminated and deployed on other nodes
Done so via: kubectl drain <node name>
Node cordoned and made unschedulable
To uncordon node: kubectl uncordon <nodename>
To mark the node as unschedulable, run: kubectl cordon <nodename>
Doesn't terminate any preexisting pods, just stops any more from being scheduled
Note: May need the flag --ignore-daemonsets and or --force

6.2 - Kubernetes Software Versions¶

When installing a kubernetes cluster, a specific version of kubernetes is installed alongside
Can be viewed via kubectl get nodes in the version column
Release versions follow the process major.minor.patch
Kubernetes is regularly updated with new minor versions every couple of months
Alpha and beta versions also available
Alpha - Features disabled by default, likely to be buggy
Beta - Code tested, new features enabled by default
Stable release - Code tested, bugs fixed
Kubernetes releases found in a tarball file in Github; contains all required executables of the same version
Note: Some components within the control plane will not have the same version numbers and are released as separate files; ETCD cluster and CoreDNS servers being the main examples

6.3 - Cluster Upgrade Process¶

The kubernetes components don't all have to be at the same versions
No component should be at a version higher than the kube-api server
If Kube-API Server is version X (a minor release), then the following ranges apply for the other components for support level:
- Controller manager: X-1
- Kube-Scheduler: X-1
- Kubelet: X-2
- Kube-Proxy: X-2
- Kubectl: X-1 - X+1
At any point, Kubernetes only supports the 3 most recent minor releases e.g. 1.19 - 1.17
It's better to upgrade iteratively over minor processes e.g. 1.17 - 1.18 and so on
Upgrade process = Cluster-Dependent
If on a cloud provider, built-in functionality available
If on kubeadm/manually created cluster, must use commands:
kubeadm upgrade plan
kubeadm upgrade apply
Cluster upgrades involve two steps:
Upgrade the master node
- All management components go down temporarily during the processes
- Doesn't impact the current node workloads (only if you try to do anything with them)
Upgrade the worker nodes
- Can be done all at once - Results in downtime
- Can be done iteratively - Minimal downtime by draining nodes as they get upgraded one after another
- Could also add new nodes with the most recent software versions
Proves especially inconvenient when on a cloud provider
Upgrading via Kubeadm:
kubeadm upgrade plan
- Lists latest versions available
- Components that must be upgraded manually
- Command to upgrade kubeadm
Note: kubeadm itself must be upgraded first: apt-get upgrade -y kubeadm=major.minor-patch_min-patch_max
Check upgrade success based on CLI output and kubectl get nodes
If Kubelet is running on Master node, this must be upgraded next the master node and restart the service:
apt-get upgrade -y kubelet=1.12.0-00
systemctl restart kubelet
Upgrading the worker nodes:
Use the drain command to stop and transfer the current workloads to other nodes, then upgrade the following for each node (ssh into each one):
- Kubeadm
- Kubelet
- Node config: kubeadm upgrade node config --kubelet-version major.minor.patch
Restart the service: systemctl restart kubelet
Make sure to uncordon each node after each upgrade!

6.4 - Backup and Restore Methods¶

It's good practice to save resource configuration definition files
Kube-api server can be used to query all resources to get yaml files for each
E.g. kubectl get all --all-namespaces -o yaml > filename.yaml
The etcd cluster stores information about the state of the cluster e.g. what nodes are on it and what applications are they running
When configuring the etcd, you can configure the data directory for the etcd data store via the --data-dir flag
You can take a snapshot of the etcd database using the etcdctl utility
To restore the cluster from the backup:
Service kube-apiserver stop
etcdctl snapshot restore snapshot.db --options
New data store directory created
The etcd service file must then be reconfigured for the new cluster token and data directory
Reload the daemon and restart the service
Backup candidates:
Kube-API Server query - Generally the more common method
ETCD Server

Disaster Recovery with ETCD in Kubernetes¶

Assuming ETCDCTL is installed, use it to take a snapshot, make sure to specify the flags, which can all be found via examining the etcd pod and ARE MANDATORY for authentication:

ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save PATH/TO/BACKUP/BACKUP.db

Suppose disaster happens:
Restore the snapshot to a new folder:

ETCDCTL_API=3 etcdctl --data-dir /var/lib/etcd-from-backup \
snapshot restore PATH/TO/BACKUP/BACKUP.db

Update the etcd pod's volume hostpath and mount paths for etcd-data to be /var/lib/etcd-from-backup etc as appropriate by updating the yaml file at /etc/kubernetes/manifests/etcd.yaml
The etcd pod should automatically restart once this update is done, bringing back the pods stored in the backup along with it. (Use watch "docker ps | grep etcd" to track)

Working with ETCDCTL¶

For backup and restore purposes, make sure to set the ETCDCTL API to 3: export ETCDCTL_API=3
For taking a snapshot of the etcd cluster: etcdctl snapshot save -h and keep a note of the mandatory global options.
For a TLS-Enabled ETCD Database, the following are mandatory:
--cacert
--cert
--endpoints[IP:PORT]
--key
Use the snapshot restore option for backup: etcdctl snapshot restore -h
Note options available and apply as appropriate