Labels and taints
Intro¶
Labels, taints and taint tolerations are mechanism for choosing the node where each job pod will effectively run. The selection of the node may or may not be important to you - in many cases the default scheduling algorithm may pick the right kind of node for you, without the need to specify any labels or tolerations. However, in some cases you may want to consider choosing a specific type of node, such as when doing cost optimization, or when your computational task benefits from running on a GPU-equipped node.
Getting information about cluster node pools¶
The endpoint GET/api/process/cluster/list
will provide information about available clusters and their nodepools. A sample response is shown below:
{
"data": [
{
"properties": {
"name": "aks-1-dev",
"clustertype": "kubernetes",
"nodepools": [
{
"name": "jobpooldef",
"labels": [
"dhi.platform/ispreemptive=false",
"dhi.platform/vmsize=Standard_D4as_v4"
],
"taints": [
]
},
{
"name": "jobpoolspot1",
"labels": [
"dhi.platform/ispreemptive=true",
"dhi.platform/vmsize=Standard_L4s"
],
"taints": [
"dhi.platform/ispreemptive=true:NoSchedule"
]
}
]
}
}
]
}
Labels¶
Labels on a node (or node pool) describe its capabilities. In the previous sample listing of clusters and their nodepools we can see that the one available cluster has two nodepools. Each defines a list of 2 labels. These happen to be standardized platform-specific labels.
Label | Values | Description |
---|---|---|
dhi.platform/ispreemptive | true | false |
dhi.platform/vmsize | (varies) | Indicates the size or type of the VM in the node pool |
Taints¶
Node taints indicate aspects that have to be tolerated in order to schedule pods on that node. There is currently one taint in use - dhi.platform/ispreemptive=true (same as the label described above), which is used to run jobs on spot-based VMs. Tolerating this taint means accepting the fact that the jobs can be evicted mid-execution, or they may take longer to start (because the availability of the machines is subject to prioritization). However, the benefit of tolerating this drawback is lower price of the VM.
Choosing the right label and toleration combination¶
When choosing the right node on which to run, a collection of requiredLabels
and toleratedTaints
is available in the ContainerRuntimeSpec
object.
- by populating the collection of requiredLabels
with some values you indicate you want to run your job on a node with such labels
- by populating the toleratedTaints
collection you indicate the job is allowed to run on nodes with such taints.
- The dhi.platform/type=job label and toleration are automatically added to each job without the need for user input.
Examples:¶
Using the sample cluster listing from above we can specify some valid combinations of labels and taints and the resulting effect
Allow running on a spot machine¶
{
"runtime": {
"type": "ContainerRuntimeSpec",
"containers": ...,
"toleratedTaints": [
"dhi.platform/ispreemptive=true"
]
}
}
Require running on a spot machine¶
{
"runtime": {
"type": "ContainerRuntimeSpec",
"containers": ...,
"requiredLabels": [
"dhi.platform/ispreemptive=true"
],
"toleratedTaints": [
"dhi.platform/ispreemptive=true"
]
}
}
Run on a specific machine size¶
{
"runtime": {
"type": "ContainerRuntimeSpec",
"containers": ...,
"requiredLabels": [
"dhi.platform/vmsize=Standard_ND96amsr_v4"
]
}
}