3.0 - Scheduling¶

3.1 - Manual Scheduling¶

When a pod is made available for scheduling, the Scheduler looks at the PODs definition file for the value associated with the field NodeName
By default, NodeName's value isn't set and is added automatically when scheduling
The scheduler looks at all pods currently on the system and checks if a value has been added to the NodeName field, any which do not are a candidate for scheduling.
The scheduler identifies the best candidate for scheduling using its algorithm and schedules the pod onto that Node by adding the node's name to the NodeName field.
This setting of the NodeName field value binds the Pod to the Node.
If there is no scheduler to monitor and schedule the nodes, the pods will remain in a pending state.
Pods can be manually assigned to nodes if a scheduler isn't present.
This can be done by manually setting the pod's value for NodeName in the definition file.
This can only be done before the pod is created for the first time, it cannot be done post-creation for a pre-existing pod
To configure, as a child of the pod's Spec, add the field: nodeName: <nodename>
Alternatively, you can assign a node by creating a binding object definition file to send a post request to the pod binding API; mimicking the scheduler's actions.

apiVersion: v1
kind: Binding
metadata:
 name: nginx
target:
 apiVersion: v1
 kind: Node
 name: node02

Once the binding definition file is written, a post request can be sent to the pod's binding API; with the data set to the binding object in a JSON format in a similar vein to:

curl --header "Content-Type:application/json" --request POST --data ‘{"apiVersion":"v1", "kind": "Binding" ...}

http://$SERVER/api/v1/namespaces/default/pods/$PODNAME/binding

3.2 - Labels and Selectors¶

Built-In Kubernetes features used to help distinguish objects of similar nature from one another by grouping them
Labels are added under the metadata section, where an infinite number of labels can be added to the Kubernetes object in a key value format
To filter objects with labels, use the kubectl get command and add the flag --selector followed by the key-value pair in the format key=value

kubectl get <object> --selector key=value

Selectors are used to link objects to one another, for example, when writing a replica set definition file, use the selector feature in the spec to specify the labels the object should look for in pods to manage.
The same can be applied to services to help identify what pods or deployments it is exposing.
Annotations - Used to record data associated with the object for integration purposes e.g. version number, build name etc

3.3 - Taints and Tolerations¶

Used to set restrictions regarding what pods can be scheduled on a node.
Consider a cluster of 3 nodes with 4 pods preparing for launch:
The scheduler will place the pods across all nodes equally if no restriction applies
Suppose now only 1 node has resources available to run a particular application:
A taint can be applied to the node in question; preventing any unwanted pods from being scheduled on it.
Tolerations then need to be applied to the pod(s) to specifically run on node 1
Pods can only run on a node if their tolerations match the taint applied to the node.
Taints and tolerations allow the scheduler to allocate pods to required nodes, such that all resources are used and allocated accordingly.
Note: By default, no tolerations are applied to pods.

Taints - Node¶

To apply a taint: kubectl taint nodes <nodename> key=value:<taint-effect>
The key-value pair defined could match labels defined for resources e.g app=frontend
The taint effect determines what happens to pods that are intolerant to the taint, 1 of 3 possibilities can be specified:
NoSchedule - Pods won't be scheduled.
PreferNoSchedule - Try to avoid scheduling if possible.
NoExecute - New pods won't be scheduled, and any pre-existing pods intolerant to the taint are stopped and evicted.

Tolerations - Pod¶

To apply a toleration to a pod, one can look at the definition file
In the spec section, add similar to the following:

tolerations:
- key: app
  operator: "Equal"
  value: "blue"
  effect: "NoSchedule"

Be sure to apply the same values used when applying the taint to the node.
All values added need to be enclosed in " ".

Taint - NoExecute¶

Suppose Node1 is to be used for a particular application:
Apply a taint to node 1 with the app name and add a toleration to the pod running the app.
Setting the taint effect to NoExecute causes existing pods on the node that are intolerant to be stopped and evicted.
Taints and tolerations are only used to restrict pod access to nodes.
As there are no restrictions / taints applied to the other pods, there's a chance the app could still be placed on a different node(s).
If wanting the pod to go to a particular node, one can utilise node affinity.
Note: A taint is automatically applied to the master node, such that no pods can be scheduled to it.
View it via kubectl describe node kubemaster | grep Taint

3.4 - Node Selectors¶

Consider a 3-node cluster, with 1 node having a larger resource configuration:
In this scenario, one would like the task/process requiring more resources to go to the larger node.
To solve, can place limitations on pods
This can be done via the nodeSelector property in the definition file:

nodeSelector:
  size: node-label

NodeSelectors require the node to be labelled: kubectl label nodes <node name> <label key>=<key value>
When pod is created, it should be assigned to the labelled node so long as the resources allow it.

Limitations of NodeSelectors¶

NodeSelectors are beneficial for simple allocation tasks, but if more complex allocation is needed, Node Affinity is recommended, e.g. "go to either 1 of 2 nodes".

3.5 - Node Affinity¶

Node affinity looks to ensure that pods are hosted on the desired nodes
Can ensure high-resource consumption jobs are allocated to high-resource nodes
Node affinity allows more complex capabilities regarding pod-node limitation.
To specify in the spec section of a pod definition filem add in a new field:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      matchExpressions:
      - key: size
        operator: In
        values:
        - Large

Note: For the example above, the NotIn operator could also be used to avoid particular nodes.
Note: If just needing a pod to go to any node with a particular label, regardless of value, use the Exists operator -> no values are required in this case.
Additional operators are available, with further details provided in the documentation.
In the event that a node cannot be allocated due to a lable fault, the resulting action is dependent upon the NodeAffinityType set.

Node Affinity Types¶

Defines the scheduler's behaviour regarding Node Affinity and pod lifecycle stages
2 main types available:
RequireDuringSchedulingIgnoredDuringExecution
PreferredDuringSchedulingIgnoredDuringExecution
Other types are to be released such as requiredDuringSchedulingRequiredDuringExecution
Considering the 2 available types, can break it down into the 2 stages of a pod lifecycle:
DuringScheduling -> The pod has been created for the first time and not deployed
DuringExecution
If the node isn't available according to the NodeAffinity, the resultant action is dependent upon the NodeAffinity type:
Required:
Pod must be placed on a node that satisfies the node affinity criteria
If no node satisfies the criteria, the pod won't be scheduled
Generally used when the node placement is crucial
Preferred:
Used if the pod placement is less important than the need for running the task
If a matching node not found, the scheduler ignores the NodeAffinity
Pod placed on any available node
Suppose a pod has been running and a change is made to the Node Affinity:
The response is determined by the prefix of DuringExecution:
- Ignored:
- Pods continue to run
- Any changes in Node Affinity will have no effect once scheduled.
- Required:
- When applied, if any current pods that don't meet the NodeAffinity requirements are evicted.

3.6 - Taints and Tolerations vs Node Affinity¶

Consider a 5-cluster setup:
Blue Node: Runs the blue pod
Red Node: Runs the red pod
Green Node: runs the green pod
Node 1: To run the grey pod
Node 2: " "
Applying a taint ot each of the coloured nodes to accept their respective pod
Tolerances are then are applied to the pods
Need to apply a taint to node 1 and node 2 as the coloured pods can still be allocated to nodes where they're not wanted.
To overcome, use Node Affinity:
Label nodes with respective colours
Pods end up in the correct nodes via use of Node Selector.
There's a chance that the unwanted pods could still be allocated.
A combination of taints and tolerations, and node affinity must be used.
Apply taints and tolerations to present unwanted pod placement on nodes
Use node affinity to prevent the correct pods from being placed on incorrect nodes.

3.7 - Resource Limits¶

Consider a 3-node setup, each has a set amount of resources available i.e.:
CPU
Memory
Disk Space
The Kubernetes scheduler is responsible for allocating pods to nodes
To do so, it takes into account the node's current resource allocation and the resources requested by the pod.
If no resources are available, the scheduler will hold the pod back for release
Kubernetes automatically assumes a pod or container within a pod will require at least:
0.5 CPU Units
256Mi Memory
If the pod or container requires more resources than allocated above, one can configure the pod definition file's spec, in particular, add the following under the containers list:

resources:
  requests:
    memory: "1Gi"
    cpu: 1

Resource - CPU¶

Can be set from 1m (1 micro) to as high as required / supported by the host system.
1 CPU = 1 AWS vCPU = 1 GCP Core = 1 Azure Core = 1 Hyperthread

Resources - Memory¶

Allocate within any of the following suffix for the givne purpose and the system's capabilities:

Memory Metric	Shorthand Notation	Equivalency
Gigagbyte	G	1000M
Megabyte	M	1000K
Kilobyte	K	1000 Bytes
Gigibyte	Gi	1024Mi
Mebibyte	Mi	1024Ki
Kilibyte	Ki	1024 Bytes

Docker containers have no limit to the resources they can consume
When only running on a node, it can only use a maximum of 1vCPU unit - if the limits need changing, update the pod definition file:

resources:
  requests:
    ....
  limits:
    memory: < value and unit>
    cpu: <number>

The limits and requests are set for each pod and container
If CPU overload occurs, CPU usage is "throttled" ont he node
If repeated memory use is exceeded, the pod is terminated.

Default Resource Requirements and Limits¶

Naturally Kubernetes assumes containers request 0.5 units of CPU and 256Mi of memory
These defaults can be configured to suit for each namespace within the Kubernetes cluster by setting a limitrange, which can be produced via a yaml definition file similar to:

apiVersion: v1
kind: LimitRange
metadata:
  name: mem-limit-range # keep memory and CPU limit ranges separate
spec:
  limits:
  - default:
      memory: 512Mi
      cpu: 1
    defaultRequest:
      memory: 256Mi
      cpu: 0.5
    type: Container

Pod / Deployment Editing¶

When editing an existing pod, only the following aspects can be edited in the spec:
Image (for containers and initcontainers)
activeDeadlineSeconds
Tolerations
Aspects such as environment variables, service accounts and resource limits cannot be edited easily, but there are ways to do it:
Editing the specification: ■ Run kubectl edit pod <podname> and edit the appropriate features ■ When saving to log the changes, if the feature cannot be edited for an existing pod, the changes will be denied ■ A copy of the definition file with the changes will saved to a temporary location, which can be used to recreate the pod with the changes once the current version is deleted
Extracting and editing the yaml definition file: ■ Run kubectl get pod <podname> -o yaml > <filename.yaml> ■ Make the desired changes to the yaml file and delete the current version of the pod ■ Create the pod again with the file
When editing the deployment, any aspect of its underlying pods can be edited as the pod template is a child of the deployment spec
When changes are made, the deployment will automatically delete and create new pods to apply the updates as appropriate

3.8 - DaemonSets¶

Daemonsets are similar in nature to replicasets, they provide assistance in the deployment of multiple instances of a pod
Daemonsets run only one instance of the pod per node
Whenever a new node is added, the pod is automatically added to the node and vice versa for when the node is removed
Use cases of Daemonsets include monitoring and logging agents
Removes the need for manually deploying one of these pods to any new nodes within the cluster
Kubernetes components such as Kube-Proxy could be deployed as a Daemonset as one pod is required per cluster ■ Similar network solutions could also be deployed as a Daemonset
Daemonsets can be deployed via a definition file, it's similar in structure to that of a Replicaset, with the only difference being the Kind

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: monitoring-daemon
spec:
  selector:
    matchLabels:
      app: monitoring-agent
    template:
      metadata:
        labels:
          app: monitoring-agent
      spec:
        containers:
        - name: monitoring-agent
          image: monitoring-agent

To view daemonsets, use the kubectl get daemonsets command
You can view the details of the daemonset with the kubectl describe command i.e. kubectl describe daemonset <daemonset name>
Prior to Kubernetes v1.12, a Daemonset would work by manually setting the nodename for each pod to be allocated, thus bypassing the scheduler
Post Kubernetes v1.12, the Daemonset uses the default scheduler and Node Affinity rules discussed previously to allocate the single pod to each node

3.9 - Static Pods¶

Kubelet relies on the kube-apiserver for instructions on what pods to load on its respective node
The instructions are determined by the kube scheduler which was stored in the etcd server
Considerations must be made if any of the kube-api server, scheduler and etcd server are not present, as a worst case scenario, suppose none of them are available
Kubelet is capable of managing a node independently to an extent
The only thing kubelet knows to do is to create pods, however in this scenario there's no api server to feed it the instructions based on yaml definition files
To work around this, you can configure the kubelet to read the pod definition files from a directory on the server designated to store information about pods
Once configured, the Kubelet will periodically check the directory for any new files, where it reads the information and creates pods based on the information provided
In addition to creating the pods, the kubelet would take actions to ensure the pod remains running, i.e.:
if a pod crashes, kubelet will attempt to restart it
if any changes are made to any files within the directory, the kubelet will recreate the pod to cause the changes to occur
Pods created in this manner, without the intervention of the API server or any other aspects of a kubernetes cluster, are Static Pods
Note: Only pods could be created in this manner, objects such as Deployments and Replicasets cannot be created in this manner
To configure the "Desginated Folder" for the kubelet to look in for pod definition files, add the following option to the kubelet service file kubelet.service; note the directory could be any directory on the system: --pod-manifest-path=/etc/kubernetes/manifests
Alternatively, you could create a yaml file to specify the path the kubelet should look at, i.e. staticPodPath: /etc/Kubernetes/manifests, which can be referenced by adding the --config=/path/to/yaml/file to the service file
Note this is the kubeadm approach
Once static pods are created, they can be viewed by docker ps (can't use kubectl due to no api server)
It should be noted that even if the api server is present, both static pods and traditional pods can be created
The api server is made aware of the static pods because when the kubelet is part of a cluster and creates static pods, it creates a mirror object in the kube api server
You can read details of the pod but cannot make changes via the kubectl edit command, only via the actual manifest
Note: the name of the pod is automatically appended with the name of the node it's assigned to
Because static pods are independent of the Kubernetes control plane, they can be used to deploy the control plane components themselves as pods on a node
Install kubelet on all the master nodes
Create pod definition files that use docker images of the various control plane components (api server, controller, etcd server etc)
Place the definition files in the chosen manifests folder
The pods will be deployed via the kubelet functionality
Note: By doing this, you don't have to download the associated binaries, configure services or worry about services crashing
In the event that any of these pods crash, they will automatically be restarted by Kubelet with them being a static pod
Note: To delete a static pod, you have to delete the yaml file associated with it from the path configured

Static Pods vs Daemonsets¶

Static Pods	Daemonsets
Created via Kubelet	Created via Kube-API Server (Daemonset controller)
Used to deploy control plane components as static pods	Used to deploy monitoring and logging agents on nodes
Ignored by Kube-Scheduler	Ignored by Kube-Scheduler

3.10 - Multiple Schedulers¶

The default scheduler has its own algorithm that takes into accounts variables such as taints and tolerations and node affinity to distribute pods across nodes
In the event that advanced conditions must be met for scheduling, such as placing particular components on specific nodes after performing a task, the default scheduler falls down
To get around this, Kubernetes allows you to write your own scheduling algorithm to be deployed as the new default scheduler or an additional scheduler
Via this, the default scheduler still runs for all usual purposes, but for the particular task, the new scheduler takes over
You can have as many schedulers as you like for a cluster
When creating a pod or deployment, you can specify Kubernetes to use a particular scheduler
When downloading the binary for kube-scheduler, there is an option in he kube-scheduler.service file that can be configured; --scheduler-name
Scheduler name is set to default-scheduler if not specified
To deploy an additional scheduler, you can use the same kube-scheduler binary or use one built by yourself
In either case, the two schedulers will run as their own services
It goes without saying that the two schedulers should have separate names for differentiation purposes
If a cluster has been created via the Kubeadm manner, schedulers are deployed via yaml definition files, which you can then use to create additional schedulers by copying the file
Note: customise the scheduler name with the --scheduler-name flag
Note: The leader-elect option should be used when you have multiple copies of the scheduler running on different master nodes
Usually observed in a high-availability setup where there are multiple master nodes running the kube-scheduler process
If multiple copies of the same scheduler are running on different nodes, only one can be active at a time
The leader-elect option helps in choosing a leading scheduler for activities, to get multiple schedulers working, do the following: ■ If you don't have multiple master nodes running, set it to false ■ If you do have multiple masters, set an additional parameter to set a lock object name
This differentiates the new custom scheduler from the default during the leader election process
The custom scheduler can then be created using the definition file and deployed to the kube-system namespace
From here, pods can be created and configured to be scheduled by a particular scheduler by adding the field schedulerName to its definition file
Note: Any pods created in this manner to be scheduled by the custom scheduler will remain in a pending state if the scheduler wasn't configured correctly
To confirm the correct scheduler picked the pod up, use kubectl get events
To view the logs associated with the scheduler, run: kubectl logs <scheduler name> --namespace=kube-system

3.11 - Configuring Scheduler Profiles¶

Schedulers can be configured manually or set up via kubeadm
Additional schedulers can be created via yaml files, which can then be configured with naming and identifying the "leader" of the schedulers for high-availability setups
Advanced options are available, but are outside the scope of the course