GPU-based workloads

Cloudfleet Kubernetes Engine (CFKE) supports NVIDIA GPUs for workloads that require GPU acceleration. This guide explains how to provision nodes with NVIDIA GPUs and schedule workloads effectively.

Adding nodes with NVIDIA GPUs

Depending on the node type, you can add nodes with NVIDIA GPUs to your cluster in two ways:

  • Self-Managed Nodes: Enable GPU support by adding the --install-nvidia-drivers flag to the cloudfleet clusters add-self-managed-node CLI command when adding a self-managed node. For more information, refer to the Self-Managed Nodes document.

  • Auto-Provisioned Nodes: Fleets support automatic provisioning of cloud instances with NVIDIA GPUs. If a workload requires a GPU and existing nodes cannot satisfy this requirement, CFKE will automatically provision a new GPU-equipped node. Refer to the next section to learn about the labels used for automated GPU node provisioning.

In both cases, NVIDIA drivers and the NVIDIA Container Toolkit are installed automatically and configured correctly to support GPU workloads. Please note that the usage of the NVIDIA drivers is subject to the NVIDIA Driver License Agreement. By using the NVIDIA drivers, you agree to the terms of the NVIDIA Driver License Agreement.

Scheduling GPU-based workloads with CFKE

The number of GPUs in a node is exposed as a capacity field in the node object, identified by the field nvidia.com/gpu. You can use this field to schedule workloads on GPU-enabled nodes. Below is an example:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "nvidia/samples:vectoradd-cuda11.6.0"
    resources:
      limits:
        nvidia.com/gpu: 1

In this example, the nvidia.com/gpu field specifies the requirement of one GPU. The Kubernetes scheduler will assign this workload to a node with at least one GPU available. If no GPU-enabled node is available and a Fleet is configured, CFKE will provision a new GPU-equipped node.

Requesting specific GPU models

While nvidia.com/gpu indicates the number of GPUs, it does not specify the GPU model. To schedule workloads based on specific GPU models, use the following labels:

  • cfke.io/accelerator-manufacturer: Manufacturer (e.g., NVIDIA).
  • cfke.io/accelerator-name: Model name (e.g., V100).
  • cfke.io/accelerator-memory: Memory size in GB (e.g., 24).

To schedule a workload on a node with an NVIDIA V100 GPU:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  nodeSelector:
    cfke.io/accelerator-name: V100
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "nvidia/samples:vectoradd-cuda11.6.0"
    resources:
      limits:
        nvidia.com/gpu: 1

If multiple clouds support the specified GPU (e.g., AWS and GCP for V100), CFKE will provision the most cost-effective option. To target a specific cloud provider, use the cfke.io/provider label:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  nodeSelector:
    cfke.io/accelerator-name: V100
    cfke.io/provider: GCP
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "nvidia/samples:vectoradd-cuda11.6.0"
    resources:
      limits:
        nvidia.com/gpu: 1

Available GPU models and providers

Provider GPU Model
AWS K80
AWS M60
AWS T4
GCP T4
GCP P4
AWS L4
GCP L4
AWS A10
AWS V100
GCP V100
GCP P100
AWS A100
GCP A100
AWS L40S
AWS H100
GCP H100
AWS H200

Once a suitable node is provisioned, CFKE will update the nvidia.com/gpu field on the node object, enabling the workload to be scheduled.

Please note that your cloud account must have the necessary quota to provision GPU-equipped instances.

For additional details on GPU scheduling, refer to the Kubernetes Documentation.