Create GPU Kubernetes cluster with Lambda cloud

Lambda is a leading cloud provider that specializes in delivering high-performance AI infrastructure, offering fast and modern NVIDIA GPUs to empower deep learning initiatives. Their cutting-edge hardware and cloud solutions have garnered the trust of industry giants like Microsoft, Intel, and Amazon Research, enabling teams to accelerate their AI breakthroughs.

Lambda has a strong track record and a clear commitment to advancing AI. Their public cloud offering stands out for its excellent hardware rental options. However, it currently does not include built-in container orchestration features like Kubernetes, which have become increasingly important for managing AI/ML workloads.

This is where Cloudfleet comes in. Cloudfleet Kubernetes Engine, a managed Kubernetes solution that runs across different infrastructure platforms, supports running Lambda servers as a compute node using the self-managed nodes feature.

In this tutorial, we explore how to create a Lambda server with a GPU and later add it to Cloudfleet Kubernetes Engine and run a simple Kubernetes Pod using the GPU.

Spinning up a Lambda cloud GPU instance

After signing up, verifying your email and logging into your Lambda Cloud dashboard for the first time, you should be presented with a screen that looks like this:

Lambda cloud dashboard

From this screen you should click the “Launch Instance” button where you will be presented with a screen with available server options:

Lambda cloud new instance wizard

Let us select this time the GH200 (Gracehopper) instance type which has a NVIDIA-made ARM CPU. Cloudfleet supports ARM architecture but if your software does not, you should continue with a x64-based instance.

On the next step, we are choosing the region where our instance will be launched at:

Lambda cloud select instance region

At this time, GH200 instances are only available in Washington. Cloudfleet works globally and it really does not matter where the instance is deployed at. That’s why we are proceeding with Washington.

On the next step, we are asked to attach a filesystem. Filesystems at Lambda are used to persist data across different instances. We are given the option of not attaching a filesystem. We do not need to persist data across instances, so let us click on “Don’t attach a filesystem”:

Lambda cloud set up file system

Finally we are asked to add our existing public SSH key or generate a new one.

Lambda cloud instance SSH key

Let’s assume that we do not have a pre-existing key, and let’s create a new one by clicking “Generate a new SSH key” and naming it “cloudfleet.” Creating the SSH key will trigger the download of the private key. To use the downloaded private key to access the SSH, we need to change its permissions. We can do this by running the following command:

chmod 600 cloudfleet.pem

Lambda asks us to agree to the terms and conditions as the last step, then start launching the instance. In few minutes we will see our instance running:

Lambda cloud running GPU instance On the same screen we can see the default username and the IP address of the instance.

Just to make sure, let us try now connecting to our new instance via SSH using the information above and our private key:

ssh ubuntu@192.222.59.83 -i cloudfleet.pem

Lambda cloud SSH into running instance

We’ve a running GPU instance and can proceed to add it to the Cloudfleet Kubernetes cluster!

Create a Cloudfleet account and cluster

After setting up the VM, the next step is to create a Kubernetes cluster in Cloudfleet. For this tutorial, we will use a free Kubernetes cluster that you can create in any Cloudfleet account after registering.

Follow these steps to create a Cloudfleet cluster:

  1. Go to https://console.cloudfleet.ai/account/signup

  2. Complete the account registration and sign in to the Cloudfleet console.

  3. In the console, click Create Cluster and follow the setup wizard. You can select a control plane region close to you, but remember that you can add nodes from any location, regardless of the control plane’s region.

  4. Once the cluster is created, click Connect to Cluster and follow the instructions to install the Cloudfleet CLI and configure authentication. At the end of the setup, you’ll see instructions on how to access the cluster with kubectl.

After completing these steps, verify that the cluster is running:

$ kubectl cluster-info
Kubernetes control plane is running at https://c515a6f3-b61a-451f-87c1-cc4600b172e3.northamerica-central-1.cfke.cloudfleet.dev:6443
CoreDNS is running at https://c515a6f3-b61a-451f-87c1-cc4600b172e3.northamerica-central-1.cfke.cloudfleet.dev:6443/api/v1/namespaces/kube-system/services/coredns:udp-53/proxy

Your cluster ID will differ from the example above. You can find the cluster ID on the Cloudfleet console; the first part of the control plane URL is also the cluster ID. (In this case, “c515a6f3-b61a-451f-87c1-cc4600b172e3”.) You will need your cluster ID for the next step, so take note of it.

Next, confirm that no nodes are currently registered:

$ kubectl get nodes
No resources found.

At this stage, the cluster has no nodes because we haven’t added any yet. Let’s add one now.

Add the Lambda cloud instance to the Cloudfleet Kubernetes cluster

Although Cloudfleet supports managing the lifecycle of nodes automatically for many cloud providers, for platforms like Lambda, Cloudfleet’s self-managed nodes feature enables seamless integration of regular Linux VMs into Kubernetes clusters.

Now, let’s add the Lambda node to the cluster using the Cloudfleet CLI, which we previously installed and configured.

In our case, the command looks like this:

cloudfleet clusters add-self-managed-node c515a6f3-b61a-451f-87c1-cc4600b172e3 --host 192.222.59.83 --ssh-username ubuntu --ssh-key cloudfleet.pem --region us-east-3 --zone us-east-3 --install-nvidia-drivers

Do not forget to change the cluster ID, IP address and the SSH key location with your values.

region and zone are mandatory flags used to specify the node’s region and zone. You can assign any meaningful string to these values, which are then mapped to Kubernetes node labels (`topology.kubernetes.io/region` and `topology.kubernetes.io/zone`). These labels enable workload scheduling based on node topology. We used in this example “us-east-3” because Lambda calls the Washington region by that name.

You can find more information about this command in the Cloudfleet documentation.

When we execute the command, the Cloudfleet CLI installs the required packages and files on the VM. Once complete, it starts the kubelet, which automatically registers the node with the Kubernetes cluster.

In a few seconds, we can list the nodes again to confirm that the new node has joined the cluster:

$ kubectl get nodes
NAME   STATUS     ROLES    AGE    VERSION
192-222-59-83   Ready    <none>   24s    v1.31.3

If we describe the node to see its details:

kubectl describe nodes 192-222-59-83
Name:               192-222-59-83
Roles:              <none>
Labels:             beta.kubernetes.io/arch=arm64
                    beta.kubernetes.io/os=linux
                    cfke.io/accelerator-manufacturer=NVIDIA
                    cfke.io/provider=self-managed
                    cfke.io/provider-region=us-east-3
                    cfke.io/provider-zone=us-east-3
                    failure-domain.beta.kubernetes.io/region=us-east-3
                    failure-domain.beta.kubernetes.io/zone=us-east-3
                    internal.cfke.io/can-schedule-system-pods=true
                    kubernetes.io/arch=arm64
                    kubernetes.io/hostname=192-222-59-83
                    kubernetes.io/os=linux
                    topology.kubernetes.io/region=us-east-3
                    topology.kubernetes.io/zone=us-east-3
....
Capacity:
  cpu:                64
  ephemeral-storage:  4161736552Ki
  hugepages-16Gi:     0
  hugepages-2Mi:      0
  hugepages-512Mi:    0
  memory:             551213760Ki
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  cpu:                64
  ephemeral-storage:  3835456399973
  hugepages-16Gi:     0
  hugepages-2Mi:      0
  hugepages-512Mi:    0
  memory:             551111360Ki
  nvidia.com/gpu:     1

We see “nvidia.com/gpu: 1” in the capacity section which shows us that the GPU is now available to be used in a Pod.

Deploy a Pod using the GPU

Now, let’s put everything together and run a Pod that uses the NVIDIA GPU and CUDA. We can run the following command to deploy an example Pod:

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      # https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/cuda-sample
      image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
      resources:
        limits:
          nvidia.com/gpu: 1
EOF

Please pay attention to the “limits” section, where we instruct Kubernetes to run this Pod on a node that has a GPU, which is our Lambda node.

Let us see if the Pod is successfully scheduled:

$ kubectl get pods
NAME                READY   STATUS    RESTARTS   AGE
gpu-test   1/1     Running   0          5s

And finally we can view the logs of the Pod to make sure everything worked as expected:

$ kubectl logs gpu-test
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Next steps

That’s a wrap!

In this tutorial, we covered the quickest way to set up a Kubernetes cluster on Lambda. We’ve seen how to create a Lambda instance with a modern GPU and add it to Cloudfleet’s managed Kubernetes cluster, and run a GPU job on it.

With Cloudfleet, you can create a free Kubernetes cluster in minutes without manually managing infrastructure. Cloudfleet allows you to bring all your infrastructure providers together under one single Kubernetes cluster, creating a global “supercomputer” with just a few clicks.

To learn more about Cloudfleet, visit our homepage.