GPU-based workloads
Cloudfleet Kubernetes Engine (CFKE) supports NVIDIA GPUs for workloads that require GPU acceleration. This guide explains how to provision nodes with NVIDIA GPUs and schedule workloads effectively.
Adding nodes with NVIDIA GPUs
Depending on the node type, you can add nodes with NVIDIA GPUs to your cluster in two ways:
-
Self-Managed Nodes: Enable GPU support by adding the
--install-nvidia-driversflag to thecloudfleet clusters add-self-managed-nodeCLI command when adding a self-managed node. For more information, refer to the Self-Managed Nodes document. -
Auto-Provisioned Nodes: Fleets support automatic provisioning of cloud instances with NVIDIA GPUs. If a workload requires a GPU and existing nodes cannot satisfy this requirement, CFKE will automatically provision a new GPU-equipped node. Refer to the next section to learn about the labels used for automated GPU node provisioning.
In both cases, NVIDIA drivers and the NVIDIA Container Toolkit are installed automatically and configured correctly to support GPU workloads. Please note that the usage of the NVIDIA drivers is subject to the NVIDIA Driver License Agreement. By using the NVIDIA drivers, you agree to the terms of the NVIDIA Driver License Agreement.
Scheduling GPU-based workloads with CFKE
The number of GPUs in a node is exposed as a capacity field in the node object, identified by the field nvidia.com/gpu. You can use this field to schedule workloads on GPU-enabled nodes. Below is an example:
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "nvidia/samples:vectoradd-cuda11.6.0"
resources:
limits:
nvidia.com/gpu: 1
In this example, the nvidia.com/gpu field specifies the requirement of one GPU. The Kubernetes scheduler will assign this workload to a node with at least one GPU available. If no GPU-enabled node is available and a Fleet is configured, CFKE will provision a new GPU-equipped node.
Requesting specific GPU models
While nvidia.com/gpu indicates the number of GPUs, it does not specify the GPU model. To schedule workloads based on specific GPU models, use the following labels:
cfke.io/accelerator-manufacturer: Manufacturer (e.g.,NVIDIA).cfke.io/accelerator-name: Model name (e.g.,V100).cfke.io/accelerator-memory: Memory size in GB (e.g.,24).
To schedule a workload on a node with an NVIDIA V100 GPU:
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
nodeSelector:
cfke.io/accelerator-name: V100
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "nvidia/samples:vectoradd-cuda11.6.0"
resources:
limits:
nvidia.com/gpu: 1
If multiple clouds support the specified GPU (e.g., AWS and GCP for V100), CFKE will provision the most cost-effective option. To target a specific cloud provider, use the cfke.io/provider label:
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
nodeSelector:
cfke.io/accelerator-name: V100
cfke.io/provider: GCP
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "nvidia/samples:vectoradd-cuda11.6.0"
resources:
limits:
nvidia.com/gpu: 1
Available GPU models and providers
| Provider | GPU Model |
|---|---|
| AWS | K80 |
| AWS | M60 |
| AWS | T4 |
| GCP | T4 |
| GCP | P4 |
| AWS | L4 |
| GCP | L4 |
| AWS | A10 |
| AWS | V100 |
| GCP | V100 |
| GCP | P100 |
| AWS | A100 |
| GCP | A100 |
| AWS | L40S |
| AWS | H100 |
| GCP | H100 |
| AWS | H200 |
Once a suitable node is provisioned, CFKE will update the nvidia.com/gpu field on the node object, enabling the workload to be scheduled.
Please note that your cloud account must have the necessary quota to provision GPU-equipped instances.
For additional details on GPU scheduling, refer to the Kubernetes Documentation.
GPU sharing with time slicing and MPS
CFKE supports GPU sharing technologies that allow multiple containers to access a single physical GPU, improving GPU utilization and reducing costs. Two primary methods are available: time slicing and Multi-Process Service (MPS).
GPU time slicing leverages NVIDIA’s built-in time-sharing capability, using instruction-level preemption available in GPUs from the Pascal architecture onward to perform context switching between processes. Time slicing provides software-level isolation with address space, performance, and error isolation between containers. When enabled, multiple containers can share a single GPU by each requesting one full GPU (nvidia.com/gpu: 1), while the physical GPU is shared among all containers.
This approach is optimal for bursty or interactive workloads with idle periods, testing and prototyping environments, and workloads that don’t require dedicated GPU resources.
However, time slicing provides no memory limit enforcement between shared workloads, and the rapid context switching may introduce performance overhead. Multi-Process Service (MPS) is an alternative sharing strategy that allows multiple CUDA processes from different containers to share a single GPU context. Unlike time slicing, which relies on context switching and time-multiplexing, MPS enables concurrent submission of work streams to the GPU through a control daemon that coordinates access.
MPS reduces context switching overhead and provides lower latency compared to time slicing. It is particularly beneficial for inference workloads with small batch sizes, multiple small CUDA processes, and applications that are CUDA-aware. MPS is generally better suited for inference workloads than for training workloads.
As with time slicing, MPS requires each container to request one full GPU (nvidia.com/gpu: 1). Proper configuration is necessary to enable MPS, and it may not be suitable for all workload types.
Both GPU time slicing and MPS are advanced features that require cluster-level configuration and are not enabled by default. To enable GPU sharing for your cluster, contact Cloudfleet support.
← Accessing cloud APIs securely