3.0 - Configuration¶
3.1 - Prerequisites: Commands and Arguments in Docker¶
-
Note: This is not a requirement for the CKAN curriculum.
-
Consider a simple scenario:
- Run a docker container via an ubuntu image:
docker run ubuntu
-
Runs an instance of the Ubuntu image and exits immediately, noted upon execution of
docker ps -a
-
This occurs because containers aren't designed to host an OS, but instead to run a specific task or process.
- Example: host a web server or database
- So long as that process stays active, so does the container.
-
If the service stops or crashes, the container exits.
-
A Dockerfile with
CMD ["bash"]
defined doesn't work as this is not a command, but a CLI instead. - When the container runs, it runs Ubuntu and launches bash
- In general, Docker doesn't attach a terminal to a container when it's ran.
-
Bash cannot find a terminal and the container exits as the process finishes/fails.
-
To solve, one can append commands to the
docker run
command e.g.
docker run ubuntu sleep 5
-
Similarly, in a Dockerfile
CMD <command> <param1>
or in JSON:
CMD ["command", "parameter"]
-
To build new image and run:
docker build -t <image name> .
-
To run:
docker run <image name>
-
To use the command but with a parameter value subject to change, change
CMD
toENTRYPOINT
i.e.:
ENTRYPOINT ["command"]
-
Any parameters specified on the CLI will automatically be appended to the entrypoint command.
-
If using entrpoint and a command parameter isn't specified, an error is likely to occur, a default value should therefore be provided.
-
Therefore,
ENTRYPOINT
andCMD
should be used together. -
Example:
ENTRYPOINT ["command"]
CMD ["parameter"]
- From this configuration, if no additional parameter(s) is provided, the
CMD
parameter will be provided. -
Any parameter on the CLI will override the
CMD
parameter -
To override the entrypoint:
docker run --entrypoint <new command> <image name>
-
Note:
ENTRYPOINT
andCMD
values should be expressed in a JSON format.
3.2 - Commands and Arguments in Kubernetes¶
- The Ubuntu sleeper image can be defined in a YAML file for Kubernetes similar to the following:
apiVersion: v1
kind: Pod
metadata:
name: <pod name>
spec:
containers:
- name: <container name>
image: <image name>:<image tag>
command: ["<command>"]
args: ["<arg1>"]
- To add anything to be appended to the
docker run
command, one adds the aargs
attribute to the container spec. - The pod can then be created through standard means such as
kubectl create -f <filenmame>.yaml
-
The Dockerfile's entrypoint is overwritten by the
command
atribute. -
Note: You cannot edit specifications of a pre-existing pod aside from:
containers
initContainers
activeDeadlineSections
-
tolerations
-
Environmental variables, in general, cannot be edited, as well as service accounts and resource limits.
- If editing is required of these parameters, 2 methods are advised:
kubectl edit pod <pod name>
- Edit the properties desired
- As outlined above, certain properties cannot be edited whilst a pod is "live" - if this happens, the requested changes to the YAML will be saved as a temporary definition.
- Delete the original pod
kubectl delete pod <pod name>
- Recreate the pod from the temp definition file:
kubectl create -f <temp filename>.yaml
-
Extract the YAML of the pod via
kubectl get pod <pod name> -o yaml > <file name>.yaml
- Open the extracted YAML and edit as appropriate e.g.
vim <filename>.yaml
- Delete the original pod and recreate similar to the latter 2 steps of method 1.
- Open the extracted YAML and edit as appropriate e.g.
-
Note: If part of a deployment, change any instance of
pod
in the commands above todeployment
. - Any changes to deployments are automatically applied to the pods within.
3.5 - Environment Variables¶
- For a given definition file, one can set environment variables via the
env:
field in containers spec. - Each environment variable is an array entry, with a name and value associated:
env:
- name: <ENV NAME>
value: <VALUE>
- Environment variables may be set via 1 of 3 ways (primarily):
- Key-value pairs (above)
- ConfigMaps
-
Secrets
-
The latter two are implemented in a similar manner to the following respective examples:
env:
- name: <ENV NAME>
valueFrom:
configMapKeyRef:
env:
- name: <ENV NAME>
valueFrom:
secretKeyRef:
3.6 - ConfigMaps¶
- When there are multiple pod definition files, it becomes difficult to manage environment data.
- This information can be removed from the definition files and managed centrally via ConfigMaps
- Used to pass configuration data as key-value pairs in Kubernetes
- When a pod is created, one can inject the ConfigMap into the pod.
-
The key-value pairs are available to the pod as environment variables for the application within the pod.
-
Configuring ConfigMaps involves 2 phases:
- Create the ConfigMap
-
Inject it to the pod.
-
Creation is achieved through standard means:
kubectl create configmap <configmap name>
-
Or if a YAML file exists:
kubectl create -f <filename>.yaml
-
Via the first method, one can automatically pass in key-value pairs:
kubectl create configmap <configmap name> --from-literal=<key>=<value> --from-literal=<key2>=<value2>
-
Multiple uses of the
--from-literal=<key>=<value>
allows multiple variables to be added. - Note: This becomes difficult when too many config items are present.
- One can also create ConfigMaps from a file e.g.
kubectl create configmap <configmap name> --from-file=/path/to/file
Creating a ConfigMap via Declaration¶
- Create a definition file:
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
APP_COLOR: blue
APP_MODE: prod
-
To create from the above:
kubectl create -f <filename>.yaml
-
Can create as many ConfigMaps and required, just ensure they are named appropriately.
-
View ConfigMaps via
kubectl get configmaps
-
Get detailed information of a ConfigMap via
kubectl describe configmap <configmap name>
-
Configuring a pod with a ConfigMap:
- In a pod definition file, under the containers in spec, add
envFrom:
list property. -
Each item in the resultant list corresponds to a ConfigMap item.
-
Example usage:
envFrom:
- configMapRef:
name: app-config
- Can apply the config file and pod definitions with the
kubectl create -f <pod definition>.yaml
Summary¶
- ConfigMaps can be used to inject environmental variables into pods
- Could also inject the data as a file or via a volume
env¶
envFrom:
- configMapRef:
name: <configmap key name>
single env¶
env:
- name: <env name>
valueFrom:
configMapKeyRef:
name: <configmap name>
key: <configmap key name>
Volumes¶
volumes:
- name: app-config-volume
configMap:
name: app-config
3.8 - Secrets¶
- Considering a simple python server:
- The hostname username and passwords are hardcoded in bad practice => high security risk.
- It would be better to store this data as a ConfigMap based on previous discussion - the problem though is that ConfigMap data is stored in a plaintext format.
-
Not applicable for sensitive info like passwords
-
Variables like username and passwords are better stored as
secrets
in Kubernetes. -
These are similar to ConfigMaps, but the values are stored in encrypted format.
-
Analogous to ConfigMaps, there are 2 steps:
- Secret Creation
-
Inject secrets to a pod.
-
Secret creation is achieved either imperatively or declaratively:
- Declarative: Use a YAML definition file to "declare" the desired configuration
- Imperative: Use the
kubectl create secret
command to "imply" Kubernetes to create a secret, and let Kubernetes figure out the configuration desired.
Imperative Secret Creation¶
-
kubectl create secret generic <secret name> --from-literal=<key>=<value>
-
As with ConfigMaps, data can be specified from the CLI in key-value-pairs via the
--from-literal
flag multiple times. -
Example:
kubectl create secret generic app-secret --from-literal=DB_HOST=mysql --from-literal=DB_USER=root --from-literal=DB_PASSWORD=password
-
For larger amounts of secrets, the data can be imported from a file, achieved by using the
--from-file
flag. -
Example:
kubectl create secret generic app-secret --from-file=app-secret.properties
Declarative Secret Creation¶
- Using a YAML definition file:
apiVersion: v1
kind: Secret
metadata:
name: app-secret
data:
DB_HOST: mysql
DB_USER: root
DB_PASSWORD: password
- As discussed, the secrets should not be stored in plaintext string format. Typically, Kubernetes secrets are stored in Base64 encrypted format.
- To convert:
echo -n '<secret plaintext value>' | base64
- Create the secret via
kubectl create -f ....
as normal - Secrets can be viewed via:
kubectl get secrets
-
Detailed information viewed via:
kubectl describe secrets <secret name>
-
To view secret in more detail:
kubectl get secret <secret name> -o yaml
-
To decode secret:
echo -n '<secret base64 value>' | base64 --decode
Secrets in Pods¶
- With both a pod and secret YAML file, the secret data can be injected as environment variables:
spec:
containers:
- envFrom:
- secretRef:
name: <secret name>
-
When
kubectl create -f ...
is run, the secret data is available as environment variables in the pod. -
As before, one can inject secrets to pods via environment variables (above) OR a single environment variable (below):
spec:
containers:
- env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: app-secret
key: DB_PASSWORD
Secrets in Volumes¶
- Secrets can also be added as volumes attached to pods:
volumes:
- name: app-secret-volume
secret:
secretName: app-secret
- If mounting the secret as a volume, each attribute in the secret is created as a file, with the value being the content.
Additional Notes¶
- Secrets are encoded in base64 format -> this can easily be decoded as they're not encrypted!
-
It's thought that secrets are a safer option, but this is only in the sense that "it's better than plaintext".
-
It's the practices regarding secret storage that makes them safer, including:
- Not checking-in secret object definition files to source code repositories
-
Enabling encryption-at-rest for secrets
-
Kubernetes takes some actions to ensure safe handling of secrets:
- A secret is only sent to a node if a pod on said node requires it
- Kubelet stores the secret into a temporary file storage so it's not persisted to a disk.
-
Once a pod is deleted, any local copies of secrets used by that pod are deleted.
-
For further improved safety regarding secrets, one could also use tools such as Helm Secrets and HashiCorp Vault.
Encrypting Secrets at Rest¶
- Additional guidance can be found in the Kubernetes documentation.
- Encryption at rest is determined by configuring the
kube-apiserver
and the etcd server.
Secret Storage in ETCD¶
- Creating a sample secret in Kubernetes:
kubectl create secret generic secret1 -n default --from-literal=mykey=mydata
- The secret can be read from
etcd
via theetcdctl
utility:
ETCDCTL_API=3 etcdctl \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
get /registry/secrets/default/secret1 | hexdump -C
- Without encryption at rest, the above command would return the secret value in plaintext format, this is a huge security risk.
Enabling Encryption at Rest¶
- The
kube-apiserver
process accepts the flag--encryption-provider-config
to determine API data encryption in ETCD. - To check if it's enabled:
ps -aux | grep kube-api | grep "encryption-provider-config"
OR examine thekube-apiserver.yaml
manifest for the same flag. - If not enabled, one can define an
EncryptionConfiguration
YAML manifest to attach to this flag. - An example (not suitable for production) from the documentation follows:
---
##
## CAUTION: this is an example configuration.
## Do not use this for your own cluster!
##
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
- configmaps
- pandas.awesome.bears.example # a custom resource API
providers:
# This configuration does not provide data confidentiality. The first
# configured provider is specifying the "identity" mechanism, which
# stores resources as plain text.
#
- identity: {} # plain text, in other words NO encryption
- aesgcm:
keys:
- name: key1
secret: c2VjcmV0IGlzIHNlY3VyZQ==
- name: key2
secret: dGhpcyBpcyBwYXNzd29yZA==
- aescbc:
keys:
- name: key1
secret: c2VjcmV0IGlzIHNlY3VyZQ==
- name: key2
secret: dGhpcyBpcyBwYXNzd29yZA==
- secretbox:
keys:
- name: key1
secret: YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXoxMjM0NTY=
- resources:
- events
providers:
- identity: {} # do not encrypt Events even though *.* is specified below
- resources:
- '*.apps' # wildcard match requires Kubernetes 1.27 or later
providers:
- aescbc:
keys:
- name: key2
secret: c2VjcmV0IGlzIHNlY3VyZSwgb3IgaXMgaXQ/Cg==
- resources:
- '*.*' # wildcard match requires Kubernetes 1.27 or later
providers:
- aescbc:
keys:
- name: key3
secret: c2VjcmV0IGlzIHNlY3VyZSwgSSB0aGluaw==
- As per usual, the first 2 lines determine the
apiVersion
andkind
. From here, the general format is to list in arrays theresources
and theprovider(s)
to be used in association with encryption and decryption. - Under
providers
, the first provider is for encryption only, any providers listed after that for the given resources are suitable for decryption only. -
There are multiple providers available, such as
identity
(no encryption),kms
,secretbox
, etc. -
Creating an example encryption configuration file, which leverages the
aesbc
encryption provider:
---
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
- configmaps
- pandas.awesome.bears.example
providers:
- aescbc:
keys:
- name: key1
# See the following text for more details about the secret value
secret: <BASE 64 ENCODED SECRET>
- identity: {} # this fallback allows reading unencrypted secrets;
# for example, during initial migration
- The secret for the
aescbc
provider can be a random 32-byte key encoded in base64 -head -c 32 /dev/urandom | base64
. - Saving the encryption config file to a given path, it will then need to be mounted to the
kube-apiserver
static pod, achieved by editing thekube-apiserver
manifest, adding the--encryption-provider-config=/path/to/config
flag, and mounting the config's location as a volume.
---
##
## This is a fragment of a manifest for a static Pod.
## Check whether this is correct for your cluster and for your API server.
##
apiVersion: v1
kind: Pod
metadata:
annotations:
kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint: 10.20.30.40:443
creationTimestamp: null
labels:
app.kubernetes.io/component: kube-apiserver
tier: control-plane
name: kube-apiserver
namespace: kube-system
spec:
containers:
- command:
- kube-apiserver
...
- --encryption-provider-config=/etc/kubernetes/enc/enc.yaml # add this line
volumeMounts:
...
- name: enc # add this line
mountPath: /etc/kubernetes/enc # add this line
readOnly: true # add this line
...
volumes:
...
- name: enc # add this line
hostPath: # add this line
path: /etc/kubernetes/enc # add this line
type: DirectoryOrCreate # add this line
...
- As a static pod is being edited, the API Server should restart automatically, if it fails to do so, troubleshoot accordingly.
- These steps should be repeated per control plane node, and for verification, the original steps with
etcdctl
can be repeated. - Note: To ensure all relevant data is encrypted, including that which is already stored, run the following command to update them with the relevant config as an administrator:
kubectl get secrets --all-namespaces -o json | kubectl replace -f -
3.11 - Docker Security¶
- Consider a host with Docker running on it.
- The host will be running processes such as the Docker Daemon
- Containers aren't completely isolated from their host -> they share the same kernel
- In general, containers are separated by namespaces (Linux)
- All processes run by a container are run by the host, just in their own namespace
-
The container can only see its own process via
ps aux
-
In the host, all processes are visible, container processes have differing IDs depending on their namespace
Security: Users¶
- Docker hosts can have root users as well as non-root users
- By default, Docker runs processes in containers as the root user.
-
True within and outside the container
-
To edit the default user for the container, use
docker run
in a similar manner to:docker run --user=<username> <container> <command>
-
To enforce security, one can add
USER
value to the Dockerfile e.g.USER 1000
- This automatically defines the user when the container is built and run.
- When running a container that defaults to the root user, Docker takes measures to prevent the root user from taking unnecessary actions via Linux Capabilities.
Linux Capabilities¶
- Listed in
/usr/include/linux/capability.h
- In containers, Docker applies limited capabilities by default
- To override, add
--cap-add <CAPABILITY NAME>
to thedocker run
command - To remove:
--cap-drop <CAPABILITY NAME>
in a similar manner - For all capabilities, add
--privileged
-> not recommended!
3.12 - Security Contexts¶
- Container security may be configured by adding or specifying users and their associated capabilities in the
docker run
command. -
Similar settings may also be handled via Kubernetes.
-
As containers are hosted within pods on Kubernetes, one can either configure security at the pod or container level.
- If configured at pod level, any changes will automatically be applied to the containers within.
- If configured at container level, these settings will override anything defined at pod level.
Security Context¶
- To configure security in the definition file, add the
securityContext:
attribute - User is set by
runAsUser: <user ID>
- To configure at container-level, add the same fields to the containers list
- To add capabilities, add
capabilities:
, then in a dictionary, addadd: ["<CAPABILITY ID>", .... ]
.
apiVersion: v1
kind: Pod
metadata:
name: pod
spec:
#securityContext:
# runAsUser: 1000
containers:
- name: ubuntu
image: ubuntu
command: ["sleep", "3600"]
securityContext:
runAsUser: 1000
capabilities:
add: ["MAC_ADMIN"]
- Note: Capabilities are only supported at container-level, not pod-level.
3.14 - Service Account¶
- A service account links to securIty concepts such as authorization and RBAC, etc.
- One of 2 account types in Kubernetes, the other being a user account:
- User accounts are used by humans e.g. development accounts.
-
Service accounts are those used by applications to interact with Kubernetes.
-
For an app to use/interact with the Kubernetes API, it needs to authenticate via service accounts.
-
Creation via:
kubectl create serviceaccount <serviceaccount name>
- View service accounts via
kubectl get serviceaccount
-
When a `serviceaccount`` is created, it automatically creates a service token to be used for authentication.
-
Token can be viewed (along with other details) via
kubectl describe serviceaccount <serviceaccount name>
-
Token is stored as a Kubernetes secret by default, it can be viewed via
kubectl describe secret <secret ID>
-
Suppose the app using the service account is already part of a Kubernetes cluster.
- One can mount the service token secret as a volume inside the pod
-
This allows it to be easily accessible by the application
-
The default service account and its corresponding token is automatically mounted as a volume to the pod.
- In the path of the mount, 3 files are stored, detailing:
Namespace
Token
ca.crt
-
The above can be viewed for a given service by
kubectl exec -it <service id> ls /var/run/secrets/kubernetes.io/<serviceaccount>
-
Note: The default service account is restricted to basic operations.
- Automatically created and mounted.
- If you wish to switch it, edit the pod definition file to add
serviceAccount:
under thespec:
, then delete and recreate the pod. -
Changes like this are automatically applied if editing a deployment.
-
Note: One can avoid automatic service account association via the addition of
automountServiceAccountToken: false
-
Example:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
serviceAccountName: build-robot
automountServiceAccountToken: false
Service Accounts Updates - 1.22/1.24¶
- All namespaces have a default service account with its own secret, when a pod is created, this service account is automatically associated with the pod, and the secret is mounted to a given location.
- As a result of this, a process within the pod can query the Kubernetes API using the mounted token.
- Checking the location where the secret is mounted, three files are found:
ca.crt
namespace
token
- the ServiceAccount token.- The token can be decoded via
jq
(or some other means e.g. jwt.io) via the following command:jq -R 'split(".") | select(length > 0) | .[0],.[1] | @base64d | fromjson' <<< < TOKEN>
- The output shows that the token has no expiry date defined in the payload section - this poses problems.
v1.22 Notes on Bound Service Account Tokens¶
-
Kubernetes already provisions JWTs to workloads, a functionality that is enabled by default, and therefore widely deployed - leading to the following problems:
-
These JWTs are not audience-bound -> Anyone with the JWT can pretend to be an authenticated user
- The current model of storing the service account token in a secret delivered to nodes provides a large attack surface to the control plane nodes.
- The JWTs are not time-bound, any JWT Compromised via the previous 2 points are valid so long as the service account exists.
-
Each JWT requires a Kubernetes secret -> Not Scalable
-
To overcome this, the
TokenRequestAPI
was introduced, this generates tokens which are Audience, Time and Object bound. - Since its introduction, when a new pod is created, it no longer requires on the token from the service account, instead, a token with a particular lifetime is created by the TokenRequestAPI via the ServiceAccount Admission controller.
- This token is then mounted as a projected volume into the pod.
apiVersion: v1
kind: Pod
metadata:
name: nginx
namespace: default
spec:
containers:
- image: nginx
name: nginx
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-<random string>
volumes:
- name: kube-api-access-<random string>
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downnwardAPI:
items:
- fieldRed:
apiVersion: v1
v1.24 Enhancements¶
- This version included changes to reduce the amount of secret-based service account tokens.
- Previously, upon
serviceaccount
creation, a secret token that had no expiry and was unbound, was automatically created and mounted into pods upon their creation. - In
v1.24
, theserviceaccount
token creation is no longer automatic, instead it must be created separately after serviceaccount creation i.e.: kubectl create serviceaccount <serviceaccount name>
kubectl create token <serviceaccount name>
- If this token is decoded via means defined previously, an expiry date can be seen in the payload.
- The secrets can still be created with no expiry if desired, however this is REALLY not advised:
apiVersion: v1
kind: Secret
type: kubernetes.io/service-account-token
metadata:
name: <secret name>
annotations:
kubernetes.io/service-account.name: <service account name>
3.17 - Resource Requirements¶
- Consider a 3-node setup, each has a set amount of resources available i.e.:
- CPU
- Memory
-
Disk Space
-
The Kubernetes scheduler is responsible for allocating pods to nodes
- To do so, it takes into account the node's current resource allocation and the resources requested by the pod.
- If no resources are available, the scheduler will hold the pod back for release
- Kubernetes automatically assumes a pod or container within a pod will require at least:
0.5
CPU Units-
256Mi
Memory -
If the pod or container requires more resources than allocated above, one can configure the pod definition file's spec, in particular, add the following under the
containers
list:
resources:
requests:
memory: "1Gi"
cpu: 1
Resource - CPU¶
- Can be set from
1m
(1 micro) to as high as required / supported by the host system. - 1 CPU = 1 AWS vCPU = 1 GCP Core = 1 Azure Core = 1 Hyperthread
Resources - Memory¶
- Allocate within any of the following suffix for the givne purpose and the system's capabilities:
Memory Metric | Shorthand Notation | Equivalency |
---|---|---|
Gigagbyte | G | 1000M |
Megabyte | M | 1000K |
Kilobyte | K | 1000 Bytes |
Gigibyte | Gi | 1024Mi |
Mebibyte | Mi | 1024Ki |
Kilibyte | Ki | 1024 Bytes |
- Docker containers have no limit to the resources they can consume
- When only running on a node, it can only use a maximum of 1vCPU unit - if the limits need changing, update the pod definition file:
resources:
requests:
memory: <value and unit>
cpu: <value and limit>
....
limits:
memory: < value and unit>
cpu: <number>
- The limits and requests can be set for each pod and container
- If CPU overload occurs, CPU usage is "throttled" on the node so it does not go beyond the limit.
- If repeated memory use is exceeded for an extended period of time, the pod is terminated with an
OOM
(Out of Memory) error.
Default Behavior¶
-
By default, Kubernetes doesn't set CPU or Memory limits, pods would be able to use as much of these resources as they like until it stops other pods from functioning.
-
Rather than just putting blanket limits or requests in place, it's often advised to tailor the requests and limits per workload.
- Alternatively, you could put in requests without limits, ensuring that each pod is guaranteed the minimum resources required, but this doesn't prevent resource quota throttling.
Limit Ranges¶
- To ensure all pods are created with particular limits by default, one can create a
LimitRange
object, both CPU and Memory constraints can be defined by a single LimitRange:
apiVersion: v1
kind: LimitRange
metadata:
name: cpu-resource-constraint
spec:
limits:
- default:
cpu: 500m
defaultRequest:
cpu: 500m
max:
cpu: "1"
min:
cpu: 100m
type: Container
- Quotas can be set for a namespace itself by creating a
ResourceQuota
object - these are namespace-scoped:
apiVersion: v1
kind: ResourceQuota
metadata:
name: my-resource-quota
spec:
hard:
requests.cpu: 4
requests.memory: 4Gi
limits.cpu: 10
limits.memory: 10Gi
3.19 - Taints and Tolerations¶
- Used to set restrictions regarding what pods can be scheduled on a node.
- Consider a cluster of 3 nodes with 4 pods preparing for launch:
-
The scheduler will place the pods across all nodes equally if no restriction applies
-
Suppose now only 1 node has resources available to run a particular application:
- A taint can be applied to the node in question; preventing any unwanted pods from being scheduled on it.
-
Tolerations then need to be applied to the pod(s) to specifically run on node 1
-
Pods can only run on a node if their tolerations match the taint applied to the node.
-
Taints and tolerations allow the scheduler to allocate pods to required nodes, such that all resources are used and allocated accordingly.
-
Note: By default, no tolerations are applied to pods.
Taints - Node¶
- To apply a taint:
kubectl taint nodes <nodename> key=value:<taint-effect>
- The key-value pair defined could match labels defined for resources e.g
app=frontend
- The taint effect determines what happens to pods that are intolerant to the taint, 1 of 3 possibilities can be specified:
NoSchedule
- Pods won't be scheduled.PreferNoSchedule
- Try to avoid scheduling if possible.NoExecute
- New pods won't be scheduled, and any pre-existing pods intolerant to the taint are stopped and evicted.
Tolerations - Pod¶
- To apply a toleration to a pod, one can look at the definition file
- In the spec section, add similar to the following:
...
spec:
containers:
...
tolerations:
- key: app
operator: "Equal"
value: "blue"
effect: "NoSchedule"
- Be sure to apply the same values used when applying the taint to the node.
- All values added need to be enclosed in " ".
Taint - NoExecute¶
- Suppose Node1 is to be used for a particular application:
- Apply a taint to node 1 with the app name and add a toleration to the pod running the app.
-
Setting the taint effect to
NoExecute
causes existing pods on the node that are intolerant to be stopped and evicted. -
Taints and tolerations are only used to restrict pod access to nodes.
- As there are no restrictions / taints applied to the other pods, there's a chance the app could still be placed on a different node(s).
-
If wanting the pod to go to a particular node, one can utilize node affinity.
-
Note: A taint is automatically applied to the master node, such that no pods can be scheduled to it.
- View it via
kubectl describe node kubemaster | grep Taint
3.20 - Node Selectors¶
- Consider a 3-node cluster, with 1 node having a larger resource configuration:
- In this scenario, one would like the task/process requiring more resources to go to the larger node.
- To solve, can place limitations on pods
- This can be done via the
nodeSelector
property in the definition file:
nodeSelector:
size: node-label
-
NodeSelectors require the node to be labelled:
kubectl label nodes <node name> <label key>=<key value>
-
When pod is created, it should be assigned to the labelled node so long as the resources allow it.
Limitations of NodeSelectors¶
- NodeSelectors are beneficial for simple allocation tasks, but if more complex allocation is needed, Node Affinity is recommended, e.g. "go to either 1 of 2 nodes".
3.22 - Node Affinity¶
- Node affinity looks to ensure that pods are hosted on the desired nodes
-
Can ensure high-resource consumption jobs are allocated to high-resource nodes
-
Node affinity allows more complex capabilities regarding pod-node limitation.
-
To specify, in the spec section of a pod definition file add in a new field:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
matchExpressions:
- key: size
operator: In
values:
- Large
- Note: For the example above, the
NotIn
operator could also be used to avoid particular nodes. -
Note: If just needing a pod to go to any node with a particular label, regardless of value, use the
Exists
operator -> no values are required in this case. -
Additional operators are available, with further details provided in the documentation.
- In the event that a node cannot be allocated due to a label fault, the resulting action is dependent upon the NodeAffinityType set.
Node Affinity Types¶
-
Defines the scheduler's behavior regarding Node Affinity and pod lifecycle stages
-
2 main types available:
RequireDuringSchedulingIgnoredDuringExecution
-
PreferredDuringSchedulingIgnoredDuringExecution
-
Other types are to be released such as
requiredDuringSchedulingRequiredDuringExecution
-
Considering the 2 available types, can break it down into the 2 stages of a pod lifecycle:
- DuringScheduling -> The pod has been created for the first time and not deployed
-
DuringExecution
-
If the node isn't available according to the NodeAffinity, the resultant action is dependent upon the NodeAffinity type:
-
Required:
- Pod must be placed on a node that satisfies the node affinity criteria
- If no node satisfies the criteria, the pod won't be scheduled
-
Generally used when the node placement is crucial
-
Preferred:
- Used if the pod placement is less important than the need for running the task
- If a matching node not found, the scheduler ignores the NodeAffinity
-
Pod placed on any available node
-
Suppose a pod has been running and a change is made to the Node Affinity:
- The response is determined by the prefix of
DuringExecution
:- Ignored:
- Pods continue to run
- Any changes in Node Affinity will have no effect once scheduled.
- Required:
- When applied, if any current pods that don't meet the NodeAffinity requirements are evicted.
Taints and Tolerations vs Node Affinity¶
- Consider a 5-cluster setup:
- Blue Node: Runs the blue pod
- Red Node: Runs the red pod
- Green Node: runs the green pod
- Node 1: To run the grey pod
-
Node 2: " "
-
Applying a taint to each of the colored nodes to accept their respective pod
-
Tolerances are then are applied to the pods
-
Need to apply a taint to node 1 and node 2 as the colored pods can still be allocated to nodes where they're not wanted.
-
To overcome, use Node Affinity:
- Label nodes with respective colors
-
Pods end up in the correct nodes via use of Node Selector.
-
There's a chance that the unwanted pods could still be allocated e.g. the grey pods could still be scheduled on the colored nodes.
-
A combination of taints and tolerations, and node affinity must be used.
- Apply taints and tolerations to present unwanted pod placement on nodes
- Use node affinity to prevent the correct pods from being placed on incorrect nodes.