Resource requests

Default resource limits¶

By default each job runs with a predefined limit on CPU cores and memory usage. A baseline of 1 CPU core and 4 GB of memory has been chosen, although these default limits may vary per environment and also over time. Storage is limited to 2 GiB for the temporary folder (/tmp) and 20 GiB for the work folder (/work). The work folder can have a larger limit based on user request (see below). The storage limits may not be "hard" limits in the sense that the folder or filesystem is limited to this size, but if the limit is exceeded, then the whole process may be evicted.

Custom requests¶

If your job requires access to more resources than the default values allow, you can specify that in the job request.

Custom requests for containers¶

For container-based jobs running in Kubernetes, each container object has some optional properties: - cpuCores: is a decimal number indicating the number of CPU cores requested for the container (fractional values such as 0.5 are allowed) - memoryMB: is an integer number indicating the amount of memory (in megabytes) requested for the container

There are some hard limits that should not be exceeded when setting these values. Each job can only be allocated a certain amount of CPU cores and memory (typically in line with what the underlying nodes can provide). This applies to each container individually, but also to all containers combined (as the whole job pod runs on a single node).

As of 17-May-2022, the hard limit on maximum requested CPU cores is 43, and the amount of memory is 340000 MB.

It may happen that a job can only reach the maximum advertised amount of resources if a certain taint is accepted for the job (so that a specific node type can be used for the job). As of 08-02-2022 there are no validations in place that will check that the requested amount of resources can be satisfied with the given set of labels and taints before the job is actually executed. It is therefor the responsibility of the user to determine that beforehand.

Example request that specifies CPU and memory requirements

HTTP REST API

{
    "runtime": {
        "type": "ContainerRuntimeSpec",
        "containers": [
            {
                "Image": "busybox",
                "Command": ["/bin/sh", "-c", "ls"],
                "cpuCores": 0.5,
                "memoryMB": 1024
            }
        ]
    }
}

Storage request¶

By filling the StorageRequest object in the ContainerRuntimeSpec the user can request larger quota for the work (/work) folder. Currently only one type of storage request is supported: LocalStorageRequestSpec. The effect of using this type of storage request is that the job will be labelled (see link) with a label that will assign the pod to a machine with larger local disk, and the work folder will be mounted onto that disk. Typical machine that is intended for workloads requiring large workspace is Standard_L8s_V2 with 1.92 TB disk.

The following example of a job body requests 220 GB of storage. The command of the job will list mounted filesystems so that you can check the capacity of the /work folder and it is even possible to allocate a large file there.

HTTP REST API

{
    "runtime": {
        "type": "ContainerRuntimeSpec",
        "containers": [
            { "image":"busybox", "command": [ "/bin/sh", "-c", "df -h; fallocate -l 200G /work/largefile.bin; ls -l /work;"] }
        ],
        "storageRequest": {
            "storageSizeGB": 220,
            "type": "LocalStorageRequestSpec"
        }
    }
}

The log from such job will show:

Note: The nature of storage provisioning is complex and its implementation in the Job service may change. Therefore multiple processes running on the same machine may or may not compete for the available disk space. For example it might be possible that multiple jobs that request only a small number of CPU cores, but large amounts of storage end up scheduled on the same node. If this becomes a problem, it is recommended to match the CPU and/or memory requirements to match the storage requirement in such a way that the whole node (or appropriate portion of it) is dedicated to the job - therefore reducing the number of competing jobs on one node. Even though this goes against the idea of abstracting away the node features and only specifying the individual resource amounts, it may be necessary until more mature storage provisioning is implemented.