Skip to content

3.0 - Scheduling

3.1 - Manual Scheduling

  • When a pod is made available for scheduling, the Scheduler looks at the PODs definition file for the value associated with the field NodeName
  • By default, NodeName's value isn't set and is added automatically when scheduling
  • The scheduler looks at all pods currently on the system and checks if a value has been added to the NodeName field, any which do not are a candidate for scheduling.
  • The scheduler identifies the best candidate for scheduling using its algorithm and schedules the pod onto that Node by adding the node's name to the NodeName field.
  • This setting of the NodeName field value binds the Pod to the Node.
  • If there is no scheduler to monitor and schedule the nodes, the pods will remain in a pending state.
  • Pods can be manually assigned to nodes if a scheduler isn't present.
  • This can be done by manually setting the pod's value for NodeName in the definition file.
  • This can only be done before the pod is created for the first time, it cannot be done post-creation for a pre-existing pod
  • To configure, as a child of the pod's Spec, add the field: nodeName: <nodename>
  • Alternatively, you can assign a node by creating a binding object definition file to send a post request to the pod binding API; mimicking the scheduler's actions.
apiVersion: v1
kind: Binding
metadata:
 name: nginx
target:
 apiVersion: v1
 kind: Node
 name: node02
  • Once the binding definition file is written, a post request can be sent to the pod's binding API; with the data set to the binding object in a JSON format in a similar vein to:
curl --header "Content-Type:application/json" --request POST --data ‘{"apiVersion":"v1", "kind": "Binding" ...}

http://$SERVER/api/v1/namespaces/default/pods/$PODNAME/binding

3.2 - Labels and Selectors

  • Built-In Kubernetes features used to help distinguish objects of similar nature from one another by grouping them
  • Labels are added under the metadata section, where an infinite number of labels can be added to the Kubernetes object in a key value format
  • To filter objects with labels, use the kubectl get command and add the flag --selector followed by the key-value pair in the format key=value
kubectl get <object> --selector key=value
  • Selectors are used to link objects to one another, for example, when writing a replica set definition file, use the selector feature in the spec to specify the labels the object should look for in pods to manage.
  • The same can be applied to services to help identify what pods or deployments it is exposing.
  • Annotations - Used to record data associated with the object for integration purposes e.g. version number, build name etc

3.3 - Taints and Tolerations

  • Used to set restrictions regarding what pods can be scheduled on a node.
  • Consider a cluster of 3 nodes with 4 pods preparing for launch:
  • The scheduler will place the pods across all nodes equally if no restriction applies

  • Suppose now only 1 node has resources available to run a particular application:

  • A taint can be applied to the node in question; preventing any unwanted pods from being scheduled on it.
  • Tolerations then need to be applied to the pod(s) to specifically run on node 1

  • Pods can only run on a node if their tolerations match the taint applied to the node.

  • Taints and tolerations allow the scheduler to allocate pods to required nodes, such that all resources are used and allocated accordingly.

  • Note: By default, no tolerations are applied to pods.

Taints - Node

  • To apply a taint: kubectl taint nodes <nodename> key=value:<taint-effect>
  • The key-value pair defined could match labels defined for resources e.g app=frontend
  • The taint effect determines what happens to pods that are intolerant to the taint, 1 of 3 possibilities can be specified:
  • NoSchedule - Pods won't be scheduled.
  • PreferNoSchedule - Try to avoid scheduling if possible.
  • NoExecute - New pods won't be scheduled, and any pre-existing pods intolerant to the taint are stopped and evicted.

Tolerations - Pod

  • To apply a toleration to a pod, one can look at the definition file
  • In the spec section, add similar to the following:
tolerations:
- key: app
  operator: "Equal"
  value: "blue"
  effect: "NoSchedule"
  • Be sure to apply the same values used when applying the taint to the node.
  • All values added need to be enclosed in " ".

Taint - NoExecute

  • Suppose Node1 is to be used for a particular application:
  • Apply a taint to node 1 with the app name and add a toleration to the pod running the app.
  • Setting the taint effect to NoExecute causes existing pods on the node that are intolerant to be stopped and evicted.

  • Taints and tolerations are only used to restrict pod access to nodes.

  • As there are no restrictions / taints applied to the other pods, there's a chance the app could still be placed on a different node(s).
  • If wanting the pod to go to a particular node, one can utilise node affinity.

  • Note: A taint is automatically applied to the master node, such that no pods can be scheduled to it.

  • View it via kubectl describe node kubemaster | grep Taint

3.4 - Node Selectors

  • Consider a 3-node cluster, with 1 node having a larger resource configuration:
  • In this scenario, one would like the task/process requiring more resources to go to the larger node.
  • To solve, can place limitations on pods
  • This can be done via the nodeSelector property in the definition file:
nodeSelector:
  size: node-label
  • NodeSelectors require the node to be labelled: kubectl label nodes <node name> <label key>=<key value>

  • When pod is created, it should be assigned to the labelled node so long as the resources allow it.

Limitations of NodeSelectors

  • NodeSelectors are beneficial for simple allocation tasks, but if more complex allocation is needed, Node Affinity is recommended, e.g. "go to either 1 of 2 nodes".

3.5 - Node Affinity

  • Node affinity looks to ensure that pods are hosted on the desired nodes
  • Can ensure high-resource consumption jobs are allocated to high-resource nodes

  • Node affinity allows more complex capabilities regarding pod-node limitation.

  • To specify in the spec section of a pod definition filem add in a new field:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      matchExpressions:
      - key: size
        operator: In
        values:
        - Large
  • Note: For the example above, the NotIn operator could also be used to avoid particular nodes.
  • Note: If just needing a pod to go to any node with a particular label, regardless of value, use the Exists operator -> no values are required in this case.

  • Additional operators are available, with further details provided in the documentation.

  • In the event that a node cannot be allocated due to a lable fault, the resulting action is dependent upon the NodeAffinityType set.

Node Affinity Types

  • Defines the scheduler's behaviour regarding Node Affinity and pod lifecycle stages

  • 2 main types available:

  • RequireDuringSchedulingIgnoredDuringExecution
  • PreferredDuringSchedulingIgnoredDuringExecution

  • Other types are to be released such as requiredDuringSchedulingRequiredDuringExecution

  • Considering the 2 available types, can break it down into the 2 stages of a pod lifecycle:

  • DuringScheduling -> The pod has been created for the first time and not deployed
  • DuringExecution

  • If the node isn't available according to the NodeAffinity, the resultant action is dependent upon the NodeAffinity type:

  • Required:

  • Pod must be placed on a node that satisfies the node affinity criteria
  • If no node satisfies the criteria, the pod won't be scheduled
  • Generally used when the node placement is crucial

  • Preferred:

  • Used if the pod placement is less important than the need for running the task
  • If a matching node not found, the scheduler ignores the NodeAffinity
  • Pod placed on any available node

  • Suppose a pod has been running and a change is made to the Node Affinity:

  • The response is determined by the prefix of DuringExecution:
    • Ignored:
    • Pods continue to run
    • Any changes in Node Affinity will have no effect once scheduled.
    • Required:
    • When applied, if any current pods that don't meet the NodeAffinity requirements are evicted.

3.6 - Taints and Tolerations vs Node Affinity

  • Consider a 5-cluster setup:
  • Blue Node: Runs the blue pod
  • Red Node: Runs the red pod
  • Green Node: runs the green pod
  • Node 1: To run the grey pod
  • Node 2: " "

  • Applying a taint ot each of the coloured nodes to accept their respective pod

  • Tolerances are then are applied to the pods

  • Need to apply a taint to node 1 and node 2 as the coloured pods can still be allocated to nodes where they're not wanted.

  • To overcome, use Node Affinity:

  • Label nodes with respective colours
  • Pods end up in the correct nodes via use of Node Selector.

  • There's a chance that the unwanted pods could still be allocated.

  • A combination of taints and tolerations, and node affinity must be used.

  • Apply taints and tolerations to present unwanted pod placement on nodes
  • Use node affinity to prevent the correct pods from being placed on incorrect nodes.

3.7 - Resource Limits

  • Consider a 3-node setup, each has a set amount of resources available i.e.:
  • CPU
  • Memory
  • Disk Space

  • The Kubernetes scheduler is responsible for allocating pods to nodes

  • To do so, it takes into account the node's current resource allocation and the resources requested by the pod.
  • If no resources are available, the scheduler will hold the pod back for release
  • Kubernetes automatically assumes a pod or container within a pod will require at least:
  • 0.5 CPU Units
  • 256Mi Memory

  • If the pod or container requires more resources than allocated above, one can configure the pod definition file's spec, in particular, add the following under the containers list:

resources:
  requests:
    memory: "1Gi"
    cpu: 1

Resource - CPU

  • Can be set from 1m (1 micro) to as high as required / supported by the host system.
  • 1 CPU = 1 AWS vCPU = 1 GCP Core = 1 Azure Core = 1 Hyperthread

Resources - Memory

  • Allocate within any of the following suffix for the givne purpose and the system's capabilities:
Memory Metric Shorthand Notation Equivalency
Gigagbyte G 1000M
Megabyte M 1000K
Kilobyte K 1000 Bytes
Gigibyte Gi 1024Mi
Mebibyte Mi 1024Ki
Kilibyte Ki 1024 Bytes
  • Docker containers have no limit to the resources they can consume
  • When only running on a node, it can only use a maximum of 1vCPU unit - if the limits need changing, update the pod definition file:
resources:
  requests:
    ....
  limits:
    memory: < value and unit>
    cpu: <number>
  • The limits and requests are set for each pod and container
  • If CPU overload occurs, CPU usage is "throttled" ont he node
  • If repeated memory use is exceeded, the pod is terminated.

Default Resource Requirements and Limits

  • Naturally Kubernetes assumes containers request 0.5 units of CPU and 256Mi of memory
  • These defaults can be configured to suit for each namespace within the Kubernetes cluster by setting a limitrange, which can be produced via a yaml definition file similar to:
apiVersion: v1
kind: LimitRange
metadata:
  name: mem-limit-range # keep memory and CPU limit ranges separate
spec:
  limits:
  - default:
      memory: 512Mi
      cpu: 1
    defaultRequest:
      memory: 256Mi
      cpu: 0.5
    type: Container

Pod / Deployment Editing

  • When editing an existing pod, only the following aspects can be edited in the spec:
  • Image (for containers and initcontainers)
  • activeDeadlineSeconds
  • Tolerations
  • Aspects such as environment variables, service accounts and resource limits cannot be edited easily, but there are ways to do it:
  • Editing the specification: ■ Run kubectl edit pod <podname> and edit the appropriate features ■ When saving to log the changes, if the feature cannot be edited for an existing pod, the changes will be denied ■ A copy of the definition file with the changes will saved to a temporary location, which can be used to recreate the pod with the changes once the current version is deleted
  • Extracting and editing the yaml definition file: ■ Run kubectl get pod <podname> -o yaml > <filename.yaml> ■ Make the desired changes to the yaml file and delete the current version of the pod ■ Create the pod again with the file
  • When editing the deployment, any aspect of its underlying pods can be edited as the pod template is a child of the deployment spec
  • When changes are made, the deployment will automatically delete and create new pods to apply the updates as appropriate

3.8 - DaemonSets

  • Daemonsets are similar in nature to replicasets, they provide assistance in the deployment of multiple instances of a pod
  • Daemonsets run only one instance of the pod per node
  • Whenever a new node is added, the pod is automatically added to the node and vice versa for when the node is removed
  • Use cases of Daemonsets include monitoring and logging agents
  • Removes the need for manually deploying one of these pods to any new nodes within the cluster
  • Kubernetes components such as Kube-Proxy could be deployed as a Daemonset as one pod is required per cluster ■ Similar network solutions could also be deployed as a Daemonset
  • Daemonsets can be deployed via a definition file, it's similar in structure to that of a Replicaset, with the only difference being the Kind
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: monitoring-daemon
spec:
  selector:
    matchLabels:
      app: monitoring-agent
    template:
      metadata:
        labels:
          app: monitoring-agent
      spec:
        containers:
        - name: monitoring-agent
          image: monitoring-agent
  • To view daemonsets, use the kubectl get daemonsets command
  • You can view the details of the daemonset with the kubectl describe command i.e. kubectl describe daemonset <daemonset name>
  • Prior to Kubernetes v1.12, a Daemonset would work by manually setting the nodename for each pod to be allocated, thus bypassing the scheduler
  • Post Kubernetes v1.12, the Daemonset uses the default scheduler and Node Affinity rules discussed previously to allocate the single pod to each node

3.9 - Static Pods

  • Kubelet relies on the kube-apiserver for instructions on what pods to load on its respective node
  • The instructions are determined by the kube scheduler which was stored in the etcd server
  • Considerations must be made if any of the kube-api server, scheduler and etcd server are not present, as a worst case scenario, suppose none of them are available
  • Kubelet is capable of managing a node independently to an extent
  • The only thing kubelet knows to do is to create pods, however in this scenario there's no api server to feed it the instructions based on yaml definition files
  • To work around this, you can configure the kubelet to read the pod definition files from a directory on the server designated to store information about pods
  • Once configured, the Kubelet will periodically check the directory for any new files, where it reads the information and creates pods based on the information provided
  • In addition to creating the pods, the kubelet would take actions to ensure the pod remains running, i.e.:
  • if a pod crashes, kubelet will attempt to restart it
  • if any changes are made to any files within the directory, the kubelet will recreate the pod to cause the changes to occur
  • Pods created in this manner, without the intervention of the API server or any other aspects of a kubernetes cluster, are Static Pods
  • Note: Only pods could be created in this manner, objects such as Deployments and Replicasets cannot be created in this manner
  • To configure the "Desginated Folder" for the kubelet to look in for pod definition files, add the following option to the kubelet service file kubelet.service; note the directory could be any directory on the system: --pod-manifest-path=/etc/kubernetes/manifests
  • Alternatively, you could create a yaml file to specify the path the kubelet should look at, i.e. staticPodPath: /etc/Kubernetes/manifests, which can be referenced by adding the --config=/path/to/yaml/file to the service file
  • Note this is the kubeadm approach
  • Once static pods are created, they can be viewed by docker ps (can't use kubectl due to no api server)
  • It should be noted that even if the api server is present, both static pods and traditional pods can be created
  • The api server is made aware of the static pods because when the kubelet is part of a cluster and creates static pods, it creates a mirror object in the kube api server
  • You can read details of the pod but cannot make changes via the kubectl edit command, only via the actual manifest
  • Note: the name of the pod is automatically appended with the name of the node it's assigned to
  • Because static pods are independent of the Kubernetes control plane, they can be used to deploy the control plane components themselves as pods on a node
  • Install kubelet on all the master nodes
  • Create pod definition files that use docker images of the various control plane components (api server, controller, etcd server etc)
  • Place the definition files in the chosen manifests folder
  • The pods will be deployed via the kubelet functionality
  • Note: By doing this, you don't have to download the associated binaries, configure services or worry about services crashing
  • In the event that any of these pods crash, they will automatically be restarted by Kubelet with them being a static pod
  • Note: To delete a static pod, you have to delete the yaml file associated with it from the path configured

Static Pods vs Daemonsets

Static Pods Daemonsets
Created via Kubelet Created via Kube-API Server (Daemonset controller)
Used to deploy control plane components as static pods Used to deploy monitoring and logging agents on nodes
Ignored by Kube-Scheduler Ignored by Kube-Scheduler

3.10 - Multiple Schedulers

  • The default scheduler has its own algorithm that takes into accounts variables such as taints and tolerations and node affinity to distribute pods across nodes
  • In the event that advanced conditions must be met for scheduling, such as placing particular components on specific nodes after performing a task, the default scheduler falls down
  • To get around this, Kubernetes allows you to write your own scheduling algorithm to be deployed as the new default scheduler or an additional scheduler
  • Via this, the default scheduler still runs for all usual purposes, but for the particular task, the new scheduler takes over
  • You can have as many schedulers as you like for a cluster
  • When creating a pod or deployment, you can specify Kubernetes to use a particular scheduler
  • When downloading the binary for kube-scheduler, there is an option in he kube-scheduler.service file that can be configured; --scheduler-name
  • Scheduler name is set to default-scheduler if not specified
  • To deploy an additional scheduler, you can use the same kube-scheduler binary or use one built by yourself
  • In either case, the two schedulers will run as their own services
  • It goes without saying that the two schedulers should have separate names for differentiation purposes
  • If a cluster has been created via the Kubeadm manner, schedulers are deployed via yaml definition files, which you can then use to create additional schedulers by copying the file
  • Note: customise the scheduler name with the --scheduler-name flag
  • Note: The leader-elect option should be used when you have multiple copies of the scheduler running on different master nodes
  • Usually observed in a high-availability setup where there are multiple master nodes running the kube-scheduler process
  • If multiple copies of the same scheduler are running on different nodes, only one can be active at a time
  • The leader-elect option helps in choosing a leading scheduler for activities, to get multiple schedulers working, do the following: ■ If you don't have multiple master nodes running, set it to false ■ If you do have multiple masters, set an additional parameter to set a lock object name
  • This differentiates the new custom scheduler from the default during the leader election process
  • The custom scheduler can then be created using the definition file and deployed to the kube-system namespace
  • From here, pods can be created and configured to be scheduled by a particular scheduler by adding the field schedulerName to its definition file
  • Note: Any pods created in this manner to be scheduled by the custom scheduler will remain in a pending state if the scheduler wasn't configured correctly
  • To confirm the correct scheduler picked the pod up, use kubectl get events
  • To view the logs associated with the scheduler, run: kubectl logs <scheduler name> --namespace=kube-system

3.11 - Configuring Scheduler Profiles

  • Schedulers can be configured manually or set up via kubeadm
  • Additional schedulers can be created via yaml files, which can then be configured with naming and identifying the "leader" of the schedulers for high-availability setups
  • Advanced options are available, but are outside the scope of the course