GPU-based workloads
Cloudfleet Kubernetes Engine (CFKE) supports NVIDIA GPUs for workloads that require GPU acceleration. This guide explains how to provision nodes with NVIDIA GPUs and schedule workloads effectively.
Adding nodes with NVIDIA GPUs
Depending on the node type, you can add nodes with NVIDIA GPUs to your cluster in two ways:
-
Self-Managed Nodes: Enable GPU support by adding the
--install-nvidia-drivers
flag to thecloudfleet clusters add-self-managed-node
CLI command when adding a self-managed node. For more information, refer to the Self-Managed Nodes document. -
Auto-Provisioned Nodes: Fleets support automatic provisioning of cloud instances with NVIDIA GPUs. If a workload requires a GPU and existing nodes cannot satisfy this requirement, CFKE will automatically provision a new GPU-equipped node. Refer to the next section to learn about the labels used for automated GPU node provisioning.
In both cases, NVIDIA drivers and the NVIDIA Container Toolkit are installed automatically and configured correctly to support GPU workloads. Please note that the usage of the NVIDIA drivers is subject to the NVIDIA Driver License Agreement. By using the NVIDIA drivers, you agree to the terms of the NVIDIA Driver License Agreement.
Scheduling GPU-based workloads with CFKE
The number of GPUs in a node is exposed as a capacity field in the node object, identified by the field nvidia.com/gpu
. You can use this field to schedule workloads on GPU-enabled nodes. Below is an example:
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "nvidia/samples:vectoradd-cuda11.6.0"
resources:
limits:
nvidia.com/gpu: 1
In this example, the nvidia.com/gpu
field specifies the requirement of one GPU. The Kubernetes scheduler will assign this workload to a node with at least one GPU available. If no GPU-enabled node is available and a Fleet is configured, CFKE will provision a new GPU-equipped node.
Requesting specific GPU models
While nvidia.com/gpu
indicates the number of GPUs, it does not specify the GPU model. To schedule workloads based on specific GPU models, use the following labels:
cfke.io/accelerator-manufacturer
: Manufacturer (e.g.,NVIDIA
).cfke.io/accelerator-name
: Model name (e.g.,V100
).cfke.io/accelerator-memory
: Memory size in GB (e.g.,24
).
To schedule a workload on a node with an NVIDIA V100 GPU:
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
nodeSelector:
cfke.io/accelerator-name: V100
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "nvidia/samples:vectoradd-cuda11.6.0"
resources:
limits:
nvidia.com/gpu: 1
If multiple clouds support the specified GPU (e.g., AWS and GCP for V100
), CFKE will provision the most cost-effective option. To target a specific cloud provider, use the cfke.io/provider
label:
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
nodeSelector:
cfke.io/accelerator-name: V100
cfke.io/provider: GCP
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "nvidia/samples:vectoradd-cuda11.6.0"
resources:
limits:
nvidia.com/gpu: 1
Available GPU models and providers
Provider | GPU Model |
---|---|
AWS | K80 |
AWS | M60 |
AWS | T4 |
GCP | T4 |
GCP | P4 |
AWS | L4 |
GCP | L4 |
AWS | A10 |
AWS | V100 |
GCP | V100 |
GCP | P100 |
AWS | A100 |
GCP | A100 |
AWS | L40S |
AWS | H100 |
GCP | H100 |
AWS | H200 |
Once a suitable node is provisioned, CFKE will update the nvidia.com/gpu
field on the node object, enabling the workload to be scheduled.
Please note that your cloud account must have the necessary quota to provision GPU-equipped instances.
For additional details on GPU scheduling, refer to the Kubernetes Documentation.
← Accessing cloud APIs securely