Skip links

Exploring GPU Sharing in Kubernetes with NVIDIA KAI Scheduler and Exostellar SDG

By Zhiming Shen,

1. Introduction

Recently, NVIDIA open-sourced KAI Scheduler from run.ai (GitHub – NVIDIA/KAI-Scheduler: KAI Scheduler is an open source Kubernetes Native scheduler for AI workloads at large scale ), aimed at optimizing GPU resource efficiency in Kubernetes by allowing dynamic GPU resource allocation, batch scheduling, quota management, and workload prioritization etc. One of the key features offered by KAI is GPU sharing: allowing workloads to request fractional GPU and share physical GPUs. By running multiple workloads concurrently on the same physical GPU, ML/AI infrastructure can significantly improve GPU resource efficiency, reduce job waiting period, and achieve faster time-to-market.

We conducted a case-study to see how well KAI Scheduler (v0.2.0) supports real-world workloads. In particular, we explored whether it effectively supports running two vLLM models on the same GPU. This blog outlines our hands-on experience with KAI Scheduler, highlights key challenges in workload isolation, and compares KAI’s capabilities with Exostellar’s own fractional GPU solution: Software-Defined GPU (SDG).

2. Hands-on with KAI Scheduler v0.2.0

2.1 Creating a GKE cluster with NVIDIA GPU operator

To play with KAI, you need to first create a Kubernetes cluster that meets the pre-requisites. Here is how we created a GKE cluster for our evaluations:

export CLUSTER_NAME="kai-test" 

gcloud container clusters create $CLUSTER_NAME \ 
--zone=us-central1-a \ 
--num-nodes=1 \ 
--machine-type=e2-standard-4 \ 
--cluster-version=1.32.2-gke.1182001 \ 
--enable-ip-alias

This will bring up a cluster with a node that can be used to run ordinary workloads and daemonsets. We now need to add a GPU node pool. Note that we need to disable the default device plugin so that we can install the NVIDIA GPU operator.

gcloud container node-pools create gpu-pool \ 
--cluster=$CLUSTER_NAME \ 
--zone=us-central1-a \ 
--accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=disabled \ 
--machine-type=n1-standard-4 \ 
--num-nodes=1 \ 
--node-taints=nvidia.com/gpu=present:NoSchedule \ 
--node-labels="gke-no-default-nvidia-gpu-device-plugin=true"

Once the GPU node is ready, the next step is installing NVIDIA GPU operator:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-operator
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-operator-quota
  namespace: gpu-operator
spec:
  hard:
    pods: 100
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
        - system-node-critical
        - system-cluster-critical
EOF

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

helm install --wait --generate-name \
    -n gpu-operator \
    nvidia/gpu-operator \
    --version=v25.3.0 \
    --set hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia \
    --set toolkit.installDir=/home/kubernetes/bin/nvidia \
    --set cdi.enabled=true \
    --set cdi.default=true \
    --set driver.enabled=false

Once everything is installed, you should be able to see all the GPU operator components in running status:

# kubectl get pods -n gpu-operator
NAME                                                                                                         READY   STATUS      
RESTARTS   AGE
gpu-feature-discovery-fvbcc                                                                   1/1     Running     0          6h6m
gpu-operator-1743806983-node-feature-
discovery-gc-858cdfd9499ml                                                                1/1     Running     0          31h
gpu-operator-1743806983-node-feature-
discovery-master-7b5fd2pbk                                                                 1/1     Running     0          31h
gpu-operator-1743806983-node-feature-
discovery-worker-5wtht                                                                          1/1     Running     0          6h6m
gpu-operator-1743806983-node-feature-
discovery-worker-9tpvg                                                                          1/1     Running     0          31h
gpu-operator-7bddfb4dc8-hxwqb                                                        1/1     Running     0          31h
nvidia-container-toolkit-daemonset-xcsjl                                             1/1     Running     0          6h6m
nvidia-cuda-validator-hd99w                                                                0/1     Completed   0          6h5m
nvidia-dcgm-exporter-f6kh8                                                                  1/1     Running     0          6h6m
nvidia-device-plugin-daemonset-lb995                                               1/1     Running     0          6h6m
nvidia-operator-validator-2d8ns                                                            1/1     Running     0          6h6m

2.2 Installing KAI with GPU sharing support

The installation of KAI is pretty straight forward with helm:

helm repo add nvidia-k8s https://helm.ngc.nvidia.com/nvidia/k8s

helm repo update

helm upgrade -i kai-scheduler nvidia-k8s/kai-scheduler -n kai-scheduler --create-namespace --set "global.registry=nvcr.io/nvidia/k8s" --set "global.gpuSharing=true" --set binder.additionalArgs[0]="--cdi-enabled=true"

Note that the above installation command is different from KAI’s official document, because GPU sharing and CDI is not enabled by default.

After the installation, you should see the following pods running in the kai-scheduler namespace:

# kubectl get pods -n kai-scheduler
NAME                                              READY   STATUS    RESTARTS   AGE
binder-9c65659cc-lv9h5            1/1           Running         0               30h
podgrouper-8d7b487f-nb8vr     1/1           Running         0                31h
scheduler-7774cdd658-cfsj4     1/1           Running         0               31h

Before you run any workloads, make sure that you have created related queues:

kubectl apply -f - <<EOF
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
  name: default
spec:
  resources:
    cpu:
      quota: -1
      limit: -1
      overQuotaWeight: 1
    gpu:
      quota: -1
      limit: -1
      overQuotaWeight: 1
    memory:
      quota: -1
      limit: -1
      overQuotaWeight: 1
---
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
  name: test
spec:
  parentQueue: default
  resources:
    cpu:
      quota: -1
      limit: -1
      overQuotaWeight: 1
    gpu:
      quota: -1
      limit: -1
      overQuotaWeight: 1
    memory:
      quota: -1
      limit: -1
      overQuotaWeight: 1
EOF

Starting from here, you can run testing workloads by following the quick start guide from the official KAI github repo. For the purpose of this blog, we will mainly focus on GPU sharing.

2.3 Running vLLM with fractional GPUs on KAI

Since we only have one T4 GPU, in order to evaluate the fractional GPU solution with KAI, we picked TinyLlama-1.1B model with vLLM: TinyLlama/TinyLlama-1.1B-Chat-v1.0 · Hugging Face . The size of this model is around 2GB, and our goal is to run two TinyLlama-1.1B vLLM models with our T4 GPU, each of which is allocated with 5GB GPU memory.

The yaml file for our deployment is as following:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: kai-tinyllama-1-pvc
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: standard
  volumeMode: Filesystem
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kai-tinyllama-1
  namespace: default
  labels:
    app: kai-tinyllama-1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kai-tinyllama-1
  template:
    metadata:
      labels:
        app: kai-tinyllama-1
        runai/queue: test
      annotations:
        gpu-memory: "5120" 
    spec:
      schedulerName: kai-scheduler
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Equal"
        value: "present"
        effect: "NoSchedule"
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: kai-tinyllama-1-pvc
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "2Gi"
      containers:
      - name: kai-tinyllama-1
        image: vllm/vllm-openai:latest
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype=half"
        ]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: huggingface-secret
              key: token
        ports:
        - containerPort: 8000
        volumeMounts:
        - mountPath: /root/.cache/huggingface
          name: cache-volume
        - name: shm
          mountPath: /dev/shm

The most important part that enables KAI fractional GPU is following. As you can see, we specified kai-scheduler for our pod submitted to the test queue, and the GPU memory request was configured through annotations. For our evaluation, we set it to 5GB:

 template:
    metadata:
      labels:
        app: kai-tinyllama-1
        runai/queue: test
      annotations:
        gpu-memory: "5120" 
    spec:
      schedulerName: kai-scheduler

It will take a while for the pod to download the model and initialize vLLM. But once it is ready, you should be able to see the following log message:

# kubectl logs deployment/kai-tinyllama-1
...
INFO 04-05 23:53:32 [api_server.py:1028] Starting vLLM API server on http://0.0.0.0:8000
...
INFO 04-05 23:53:32 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 04-05 23:53:32 [launcher.py:34] Route: /invocations, Methods: POST
INFO:     Started server process [35]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

You can now send queries to the model! For example, you can simply get a shell to the pod and then call the server directly:

curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

An example response:

{"id":"chatcmpl-08886b7dba064eae9474d094432a929c","object":"chat.completion","created":1743965960,"model":"TinyLlama/TinyLlama-1.1B-Chat-v1.0","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"The capital of France is Paris.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":23,"total_tokens":31,"completion_tokens":8,"prompt_tokens_details":null},"prompt_logprobs":null}

2.4 The challenge of fractional GPU isolation and accounting on KAI

So far everything looks great, but there is one catch: KAI doesn’t really enforce GPU memory isolation. This is mentioned in the document https://github.com/NVIDIA/KAI-Scheduler/blob/main/docs/gpu-sharing/README.md:

KAI Scheduler does not enforce memory allocation limit or performs memory isolation between processes. In order to make sure the pods share the GPU device nicely it is important that the running processes will allocate GPU memory up to the requested amount and not beyond that.

What this means is that although the workload can request fractional GPUs and KAI is able to pack multiple fractional GPU workloads on the same physical GPU, it is the applications’ responsibility to play nicely with each other. This has a very important implication for applications especially vLLM, because by default vLLM assumes full control of all GPU memory that it can detect, and it will allocate 90% of it from the beginning.

With the workload that we just deployed, if you get a shell of the pod and run nvidia-smi, here is what you can see:

Article content

Note that although we requested 5GB of GPU memory for the pod, we can actually see all 15360MiB GPU memory. We are now using 14GB with our TinyLlama-1.1B model. This is confirmed by its logs:

INFO 04-05 23:53:09 [worker.py:267] the current vLLM instance can use total_gpu_memory (14.58GiB) x gpu_memory_utilization (0.90) = 13.12GiB
INFO 04-05 23:53:09 [worker.py:267] model weights take 2.05GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 0.31GiB; the rest of the memory reserved for KV Cache is 10.72GiB.

We made a copy of the kai-tinyllama-1 deployment and try to deploy another TinyLlama-1.1B with 5GB GPU memory in the request. We can see that the pod was indeed scheduled onto the same physical GPU, but vLLM failed with an “out of memory” error:

INFO 04-06 00:25:52 [model_runner.py:1110] Starting to load model TinyLlama/TinyLlama-1.1B-Chat-v1.0...

ERROR 04-06 00:25:53 [engine.py:448] CUDA out of memory. Tried to allocate 44.00 MiB. GPU 0 has a total capacity of 14.58 GiB of which 37.62 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 732.33 MiB is allocated by PyTorch, and 27.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

To solve this problem, vLLM applications that share the GPU on KAI would need to use additional flags to control GPU memory utilization (–gpu-memory-utilization). However, it takes some effort to tune this parameter for different GPUs so that each vLLM’s memory usage is expected. For example, in our setup this value needs to be set to 0.34 in order to make the overall GPU memory usage under 5G. But even with that, since each application sees the whole GPU, there is no easy way to do accounting properly and no one can really tell how much GPU resource is being used by each application. The complexity introduced by this can defeat the purpose of sharing the GPU from the first place.

As a conclusion, fractional GPU is not just a scheduling problem. Without a proper isolation mechanism, GPU applications can easily interfere with each other, and it can introduce significant complexities for the AI developers to handle. The ideal fractional GPU solution should be able to:

  1. Pack multiple GPU workloads onto the same physical GPU efficiently.
  2. Provide virtualization and isolation support so that each application can run with the same assumption of dedicated GPUs.

3. Exostellar Software-Defined GPU (SDG)

3.1 Introduction to SDG

The mechanism that KAI leverages for GPU sharing is “time slicing”. It offers flexible GPU sharing but lack essential memory isolation. Conversely, hardware-based solutions like NVIDIA’s Multi-Instance GPU (MIG) deliver improved isolation through dedicated fractional GPU partitions, but the requirement of statically partitioning the GPU with pre-defined profiles inherently restricts flexibility and adaptability.

Article content

Exostellar SDG introduces a new approach to GPU virtualization. SDG allows users to define the exact amount of GPU memory and compute resources required through software configurations, thus delivering unmatched flexibility. Compared with traditional fractional GPU solutions, SDG offers the following unique advantages:

  • Flexibility and vendor-independence: SDG is compatible with different GPU types and versions, such as older versions of NVIDIA GPUs that don’t provide MIG functions, AMD GPUs, and other accelerators.
  • Fractionalization of both compute and memory: depending on the underlying hardware capabilities, SDG allows control of both compute and memory, taking one step further than memory-only approach.
  • Isolation and accounting: SDG is essentially a GPU virtualization solution. The application that runs with SDG gets the illusion of a dedicated GPU hardware. We also make sure that accounting is done properly so that all of the monitoring functions that come with third-party tools and libraries will see the usage of the specific SDG consistently.
  • Oversubscription and dynamic right-sizing: as the foundational platform for Exostellar AI infrastructure optimization platform, SDG provides support of advanced optimization solutions such as oversubscription and dynamic right-sizing, bringing the overall GPU utilization to the next level.

3.2 Running vLLM with SDG

To run the same TinyLlama-1.1B workload with SDG, we just need to slightly tune the deployment yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: sdg-tinyllama-1-pvc
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: standard
  volumeMode: Filesystem
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sdg-tinyllama-1
  namespace: default
  labels:
    app: sdg-tinyllama-1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sdg-tinyllama-1
  template:
    metadata:
      labels:
        app: sdg-tinyllama-1
        sdg.exostellar.io/memory: "5GB"
        sdg.exostellar.io/count: "1"
        sdg.exostellar.io/vendor: "nvidia"
    spec:
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: sdg-tinyllama-1-pvc
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "2Gi"
      containers:
      - name: sdg-tinyllama-1
        image: vllm/vllm-openai:latest
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype=half"
        ]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: huggingface-secret
              key: token
        ports:
        - containerPort: 8000
        volumeMounts:
        - mountPath: /root/.cache/huggingface
          name: cache-volume
        - name: shm
          mountPath: /dev/shm

The magic happens with the following lines:

 template:
    metadata:
      labels:
        app: sdg-tinyllama-1
        sdg.exostellar.io/memory: "5GB"
        sdg.exostellar.io/count: "1"
        sdg.exostellar.io/vendor: "nvidia"

By specifying the vendor, the number of GPUs, and the GPU memory, users can easily configure the application with what they need precisely. In this example, we told SDG that this workload should see a 5GB virtual GPU device.

Once this is running on Kubernetes, we can check the log and see if vLLM detects the memory size correctly:

# kubectl logs deployment/sdg-tinyllama-1
...
INFO 04-05 19:36:38 [worker.py:267] the current vLLM instance can use total_gpu_memory (5.00GiB) x gpu_memory_utilization (0.90) = 4.50GiB
INFO 04-05 19:36:38 [worker.py:267] model weights take 2.05GiB; non_torch_memory takes 0.01GiB; PyTorch activation peak memory takes 0.31GiB; the rest of the memory reserved for KV Cache is 2.13GiB.

You can see that vLLM detected a 5GB GPU device, and allocated 4.5GB from it. If you run a shell of the pod and check nvidia-smi, this is what you can see:

Article content

Now we can deploy another TinyLlama-1.1B model, with the same parameters. It got allocated to the same GPU node, sharing the same T4 GPU:

# kubectl get pods -o wide
NAME              READY   STATUS    RESTARTS   AGE   IP       NODE                                  NOMINATED NODE   READINESS GATES
sdg-tinyllama-1-55ccf58b54-b75zl   1/1     Running   0          15h   10.44.1.14   gke-sdg-test-gpu-pool-4098d909-79t7   <none>           <none>
sdg-tinyllama-2-96c4bc487-27jc6    1/1     Running   0          15h   10.44.1.15   gke-sdg-test-gpu-pool-4098d909-79t7   <none>           <none>

Checking the logs from our second deployment:

# kubectl logs deployment/sdg-tinyllama-2
...
INFO 04-05 19:39:24 [worker.py:267] the current vLLM instance can use total_gpu_memory (5.00GiB) x gpu_memory_utilization (0.90) = 4.50GiB
INFO 04-05 19:39:24 [worker.py:267] model weights take 2.05GiB; non_torch_memory takes 0.01GiB; PyTorch activation peak memory takes 0.31GiB; the rest of the memory reserved for KV Cache is 2.13GiB.

If you get a shell of the second deployment, and run nvidia-smi, you can see this:

Article content

Note that both pods were taking 4.5GB GPU memory, but they could not see each other’s usage. They can run with the assumption of a dedicated GPU device, and there was no additional parameter or code change.

If you want to see the real GPU usage from the host’s perspective, you can just get a shell to the nvidia-device-plugin-daemonset, and run nvidia-smi there. As shown following, the host can see 10GB usage out of 15GB total GPU memory:

Article content

4. Conclusion

With the rapid growth of GPU workloads in the ML/AI surge, efficient GPU utilization has become increasingly important, especially in Kubernetes environments. GPU sharing—where multiple workloads can concurrently utilize fractions of a single GPU—promises significant efficiency gains. We evaluated the GPU sharing solution offered by KAI v0.2.0, with a simple use case of running two vLLM models on the same NVIDIA T4 GPU. We demonstrated the importance of providing proper isolation and virtualization of GPU devices so that existing workloads can run without complicated tuning and modifications.

With this challenge in mind, we introduced Exostellar SDG, a GPU virtualization solution that offers better combination of flexibility, isolation, and performance. SDG is vendor-independent, working with different generations of NVIDIA GPUs, as well as AMD GPUs and other accelerators. With SDG, we successfully run two vLLM models on the same T4 GPU, without any modifications to vLLM parameters.

Early access: https://exostellar.io/gpu-optimizer-early-access/

Close Bitnami banner
Bitnami