By Zhiming Shen,
1. Introduction
Recently, NVIDIA open-sourced KAI Scheduler from run.ai (GitHub – NVIDIA/KAI-Scheduler: KAI Scheduler is an open source Kubernetes Native scheduler for AI workloads at large scale ), aimed at optimizing GPU resource efficiency in Kubernetes by allowing dynamic GPU resource allocation, batch scheduling, quota management, and workload prioritization etc. One of the key features offered by KAI is GPU sharing: allowing workloads to request fractional GPU and share physical GPUs. By running multiple workloads concurrently on the same physical GPU, ML/AI infrastructure can significantly improve GPU resource efficiency, reduce job waiting period, and achieve faster time-to-market.
We conducted a case-study to see how well KAI Scheduler (v0.2.0) supports real-world workloads. In particular, we explored whether it effectively supports running two vLLM models on the same GPU. This blog outlines our hands-on experience with KAI Scheduler, highlights key challenges in workload isolation, and compares KAI’s capabilities with Exostellar’s own fractional GPU solution: Software-Defined GPU (SDG).
2. Hands-on with KAI Scheduler v0.2.0
2.1 Creating a GKE cluster with NVIDIA GPU operator
To play with KAI, you need to first create a Kubernetes cluster that meets the pre-requisites. Here is how we created a GKE cluster for our evaluations:
export CLUSTER_NAME="kai-test" gcloud container clusters create $CLUSTER_NAME \ --zone=us-central1-a \ --num-nodes=1 \ --machine-type=e2-standard-4 \ --cluster-version=1.32.2-gke.1182001 \ --enable-ip-alias
This will bring up a cluster with a node that can be used to run ordinary workloads and daemonsets. We now need to add a GPU node pool. Note that we need to disable the default device plugin so that we can install the NVIDIA GPU operator.
gcloud container node-pools create gpu-pool \ --cluster=$CLUSTER_NAME \ --zone=us-central1-a \ --accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=disabled \ --machine-type=n1-standard-4 \ --num-nodes=1 \ --node-taints=nvidia.com/gpu=present:NoSchedule \ --node-labels="gke-no-default-nvidia-gpu-device-plugin=true"
Once the GPU node is ready, the next step is installing NVIDIA GPU operator:
kubectl apply -f - <<EOF apiVersion: v1 kind: Namespace metadata: name: gpu-operator --- apiVersion: v1 kind: ResourceQuota metadata: name: gpu-operator-quota namespace: gpu-operator spec: hard: pods: 100 scopeSelector: matchExpressions: - operator: In scopeName: PriorityClass values: - system-node-critical - system-cluster-critical EOF kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml helm install --wait --generate-name \ -n gpu-operator \ nvidia/gpu-operator \ --version=v25.3.0 \ --set hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia \ --set toolkit.installDir=/home/kubernetes/bin/nvidia \ --set cdi.enabled=true \ --set cdi.default=true \ --set driver.enabled=false
Once everything is installed, you should be able to see all the GPU operator components in running status:
# kubectl get pods -n gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-fvbcc 1/1 Running 0 6h6m gpu-operator-1743806983-node-feature- discovery-gc-858cdfd9499ml 1/1 Running 0 31h gpu-operator-1743806983-node-feature- discovery-master-7b5fd2pbk 1/1 Running 0 31h gpu-operator-1743806983-node-feature- discovery-worker-5wtht 1/1 Running 0 6h6m gpu-operator-1743806983-node-feature- discovery-worker-9tpvg 1/1 Running 0 31h gpu-operator-7bddfb4dc8-hxwqb 1/1 Running 0 31h nvidia-container-toolkit-daemonset-xcsjl 1/1 Running 0 6h6m nvidia-cuda-validator-hd99w 0/1 Completed 0 6h5m nvidia-dcgm-exporter-f6kh8 1/1 Running 0 6h6m nvidia-device-plugin-daemonset-lb995 1/1 Running 0 6h6m nvidia-operator-validator-2d8ns 1/1 Running 0 6h6m
2.2 Installing KAI with GPU sharing support
The installation of KAI is pretty straight forward with helm:
helm repo add nvidia-k8s https://helm.ngc.nvidia.com/nvidia/k8s helm repo update helm upgrade -i kai-scheduler nvidia-k8s/kai-scheduler -n kai-scheduler --create-namespace --set "global.registry=nvcr.io/nvidia/k8s" --set "global.gpuSharing=true" --set binder.additionalArgs[0]="--cdi-enabled=true"
Note that the above installation command is different from KAI’s official document, because GPU sharing and CDI is not enabled by default.
After the installation, you should see the following pods running in the kai-scheduler namespace:
# kubectl get pods -n kai-scheduler NAME READY STATUS RESTARTS AGE binder-9c65659cc-lv9h5 1/1 Running 0 30h podgrouper-8d7b487f-nb8vr 1/1 Running 0 31h scheduler-7774cdd658-cfsj4 1/1 Running 0 31h
Before you run any workloads, make sure that you have created related queues:
kubectl apply -f - <<EOF apiVersion: scheduling.run.ai/v2 kind: Queue metadata: name: default spec: resources: cpu: quota: -1 limit: -1 overQuotaWeight: 1 gpu: quota: -1 limit: -1 overQuotaWeight: 1 memory: quota: -1 limit: -1 overQuotaWeight: 1 --- apiVersion: scheduling.run.ai/v2 kind: Queue metadata: name: test spec: parentQueue: default resources: cpu: quota: -1 limit: -1 overQuotaWeight: 1 gpu: quota: -1 limit: -1 overQuotaWeight: 1 memory: quota: -1 limit: -1 overQuotaWeight: 1 EOF
Starting from here, you can run testing workloads by following the quick start guide from the official KAI github repo. For the purpose of this blog, we will mainly focus on GPU sharing.
2.3 Running vLLM with fractional GPUs on KAI
Since we only have one T4 GPU, in order to evaluate the fractional GPU solution with KAI, we picked TinyLlama-1.1B model with vLLM: TinyLlama/TinyLlama-1.1B-Chat-v1.0 · Hugging Face . The size of this model is around 2GB, and our goal is to run two TinyLlama-1.1B vLLM models with our T4 GPU, each of which is allocated with 5GB GPU memory.
The yaml file for our deployment is as following:
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: kai-tinyllama-1-pvc namespace: default spec: accessModes: - ReadWriteOnce resources: requests: storage: 20Gi storageClassName: standard volumeMode: Filesystem --- apiVersion: apps/v1 kind: Deployment metadata: name: kai-tinyllama-1 namespace: default labels: app: kai-tinyllama-1 spec: replicas: 1 selector: matchLabels: app: kai-tinyllama-1 template: metadata: labels: app: kai-tinyllama-1 runai/queue: test annotations: gpu-memory: "5120" spec: schedulerName: kai-scheduler tolerations: - key: "nvidia.com/gpu" operator: "Equal" value: "present" effect: "NoSchedule" volumes: - name: cache-volume persistentVolumeClaim: claimName: kai-tinyllama-1-pvc - name: shm emptyDir: medium: Memory sizeLimit: "2Gi" containers: - name: kai-tinyllama-1 image: vllm/vllm-openai:latest command: ["/bin/sh", "-c"] args: [ "vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype=half" ] env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: huggingface-secret key: token ports: - containerPort: 8000 volumeMounts: - mountPath: /root/.cache/huggingface name: cache-volume - name: shm mountPath: /dev/shm
The most important part that enables KAI fractional GPU is following. As you can see, we specified kai-scheduler for our pod submitted to the test queue, and the GPU memory request was configured through annotations. For our evaluation, we set it to 5GB:
template: metadata: labels: app: kai-tinyllama-1 runai/queue: test annotations: gpu-memory: "5120" spec: schedulerName: kai-scheduler
It will take a while for the pod to download the model and initialize vLLM. But once it is ready, you should be able to see the following log message:
# kubectl logs deployment/kai-tinyllama-1 ... INFO 04-05 23:53:32 [api_server.py:1028] Starting vLLM API server on http://0.0.0.0:8000 ... INFO 04-05 23:53:32 [launcher.py:34] Route: /v2/rerank, Methods: POST INFO 04-05 23:53:32 [launcher.py:34] Route: /invocations, Methods: POST INFO: Started server process [35] INFO: Waiting for application startup. INFO: Application startup complete.
You can now send queries to the model! For example, you can simply get a shell to the pod and then call the server directly:
curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'
An example response:
{"id":"chatcmpl-08886b7dba064eae9474d094432a929c","object":"chat.completion","created":1743965960,"model":"TinyLlama/TinyLlama-1.1B-Chat-v1.0","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"The capital of France is Paris.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":23,"total_tokens":31,"completion_tokens":8,"prompt_tokens_details":null},"prompt_logprobs":null}
2.4 The challenge of fractional GPU isolation and accounting on KAI
So far everything looks great, but there is one catch: KAI doesn’t really enforce GPU memory isolation. This is mentioned in the document https://github.com/NVIDIA/KAI-Scheduler/blob/main/docs/gpu-sharing/README.md:
KAI Scheduler does not enforce memory allocation limit or performs memory isolation between processes. In order to make sure the pods share the GPU device nicely it is important that the running processes will allocate GPU memory up to the requested amount and not beyond that.
What this means is that although the workload can request fractional GPUs and KAI is able to pack multiple fractional GPU workloads on the same physical GPU, it is the applications’ responsibility to play nicely with each other. This has a very important implication for applications especially vLLM, because by default vLLM assumes full control of all GPU memory that it can detect, and it will allocate 90% of it from the beginning.
With the workload that we just deployed, if you get a shell of the pod and run nvidia-smi, here is what you can see:
Note that although we requested 5GB of GPU memory for the pod, we can actually see all 15360MiB GPU memory. We are now using 14GB with our TinyLlama-1.1B model. This is confirmed by its logs:
INFO 04-05 23:53:09 [worker.py:267] the current vLLM instance can use total_gpu_memory (14.58GiB) x gpu_memory_utilization (0.90) = 13.12GiB INFO 04-05 23:53:09 [worker.py:267] model weights take 2.05GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 0.31GiB; the rest of the memory reserved for KV Cache is 10.72GiB.
We made a copy of the kai-tinyllama-1 deployment and try to deploy another TinyLlama-1.1B with 5GB GPU memory in the request. We can see that the pod was indeed scheduled onto the same physical GPU, but vLLM failed with an “out of memory” error:
INFO 04-06 00:25:52 [model_runner.py:1110] Starting to load model TinyLlama/TinyLlama-1.1B-Chat-v1.0... ERROR 04-06 00:25:53 [engine.py:448] CUDA out of memory. Tried to allocate 44.00 MiB. GPU 0 has a total capacity of 14.58 GiB of which 37.62 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 732.33 MiB is allocated by PyTorch, and 27.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
To solve this problem, vLLM applications that share the GPU on KAI would need to use additional flags to control GPU memory utilization (–gpu-memory-utilization). However, it takes some effort to tune this parameter for different GPUs so that each vLLM’s memory usage is expected. For example, in our setup this value needs to be set to 0.34 in order to make the overall GPU memory usage under 5G. But even with that, since each application sees the whole GPU, there is no easy way to do accounting properly and no one can really tell how much GPU resource is being used by each application. The complexity introduced by this can defeat the purpose of sharing the GPU from the first place.
As a conclusion, fractional GPU is not just a scheduling problem. Without a proper isolation mechanism, GPU applications can easily interfere with each other, and it can introduce significant complexities for the AI developers to handle. The ideal fractional GPU solution should be able to:
- Pack multiple GPU workloads onto the same physical GPU efficiently.
- Provide virtualization and isolation support so that each application can run with the same assumption of dedicated GPUs.
3. Exostellar Software-Defined GPU (SDG)
3.1 Introduction to SDG
The mechanism that KAI leverages for GPU sharing is “time slicing”. It offers flexible GPU sharing but lack essential memory isolation. Conversely, hardware-based solutions like NVIDIA’s Multi-Instance GPU (MIG) deliver improved isolation through dedicated fractional GPU partitions, but the requirement of statically partitioning the GPU with pre-defined profiles inherently restricts flexibility and adaptability.
Exostellar SDG introduces a new approach to GPU virtualization. SDG allows users to define the exact amount of GPU memory and compute resources required through software configurations, thus delivering unmatched flexibility. Compared with traditional fractional GPU solutions, SDG offers the following unique advantages:
- Flexibility and vendor-independence: SDG is compatible with different GPU types and versions, such as older versions of NVIDIA GPUs that don’t provide MIG functions, AMD GPUs, and other accelerators.
- Fractionalization of both compute and memory: depending on the underlying hardware capabilities, SDG allows control of both compute and memory, taking one step further than memory-only approach.
- Isolation and accounting: SDG is essentially a GPU virtualization solution. The application that runs with SDG gets the illusion of a dedicated GPU hardware. We also make sure that accounting is done properly so that all of the monitoring functions that come with third-party tools and libraries will see the usage of the specific SDG consistently.
- Oversubscription and dynamic right-sizing: as the foundational platform for Exostellar AI infrastructure optimization platform, SDG provides support of advanced optimization solutions such as oversubscription and dynamic right-sizing, bringing the overall GPU utilization to the next level.
3.2 Running vLLM with SDG
To run the same TinyLlama-1.1B workload with SDG, we just need to slightly tune the deployment yaml:
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: sdg-tinyllama-1-pvc namespace: default spec: accessModes: - ReadWriteOnce resources: requests: storage: 20Gi storageClassName: standard volumeMode: Filesystem --- apiVersion: apps/v1 kind: Deployment metadata: name: sdg-tinyllama-1 namespace: default labels: app: sdg-tinyllama-1 spec: replicas: 1 selector: matchLabels: app: sdg-tinyllama-1 template: metadata: labels: app: sdg-tinyllama-1 sdg.exostellar.io/memory: "5GB" sdg.exostellar.io/count: "1" sdg.exostellar.io/vendor: "nvidia" spec: volumes: - name: cache-volume persistentVolumeClaim: claimName: sdg-tinyllama-1-pvc - name: shm emptyDir: medium: Memory sizeLimit: "2Gi" containers: - name: sdg-tinyllama-1 image: vllm/vllm-openai:latest command: ["/bin/sh", "-c"] args: [ "vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype=half" ] env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: huggingface-secret key: token ports: - containerPort: 8000 volumeMounts: - mountPath: /root/.cache/huggingface name: cache-volume - name: shm mountPath: /dev/shm
The magic happens with the following lines:
template: metadata: labels: app: sdg-tinyllama-1 sdg.exostellar.io/memory: "5GB" sdg.exostellar.io/count: "1" sdg.exostellar.io/vendor: "nvidia"
By specifying the vendor, the number of GPUs, and the GPU memory, users can easily configure the application with what they need precisely. In this example, we told SDG that this workload should see a 5GB virtual GPU device.
Once this is running on Kubernetes, we can check the log and see if vLLM detects the memory size correctly:
# kubectl logs deployment/sdg-tinyllama-1 ... INFO 04-05 19:36:38 [worker.py:267] the current vLLM instance can use total_gpu_memory (5.00GiB) x gpu_memory_utilization (0.90) = 4.50GiB INFO 04-05 19:36:38 [worker.py:267] model weights take 2.05GiB; non_torch_memory takes 0.01GiB; PyTorch activation peak memory takes 0.31GiB; the rest of the memory reserved for KV Cache is 2.13GiB.
You can see that vLLM detected a 5GB GPU device, and allocated 4.5GB from it. If you run a shell of the pod and check nvidia-smi, this is what you can see:
Now we can deploy another TinyLlama-1.1B model, with the same parameters. It got allocated to the same GPU node, sharing the same T4 GPU:
# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES sdg-tinyllama-1-55ccf58b54-b75zl 1/1 Running 0 15h 10.44.1.14 gke-sdg-test-gpu-pool-4098d909-79t7 <none> <none> sdg-tinyllama-2-96c4bc487-27jc6 1/1 Running 0 15h 10.44.1.15 gke-sdg-test-gpu-pool-4098d909-79t7 <none> <none>
Checking the logs from our second deployment:
# kubectl logs deployment/sdg-tinyllama-2 ... INFO 04-05 19:39:24 [worker.py:267] the current vLLM instance can use total_gpu_memory (5.00GiB) x gpu_memory_utilization (0.90) = 4.50GiB INFO 04-05 19:39:24 [worker.py:267] model weights take 2.05GiB; non_torch_memory takes 0.01GiB; PyTorch activation peak memory takes 0.31GiB; the rest of the memory reserved for KV Cache is 2.13GiB.
If you get a shell of the second deployment, and run nvidia-smi, you can see this:
Note that both pods were taking 4.5GB GPU memory, but they could not see each other’s usage. They can run with the assumption of a dedicated GPU device, and there was no additional parameter or code change.
If you want to see the real GPU usage from the host’s perspective, you can just get a shell to the nvidia-device-plugin-daemonset, and run nvidia-smi there. As shown following, the host can see 10GB usage out of 15GB total GPU memory:
4. Conclusion
With the rapid growth of GPU workloads in the ML/AI surge, efficient GPU utilization has become increasingly important, especially in Kubernetes environments. GPU sharing—where multiple workloads can concurrently utilize fractions of a single GPU—promises significant efficiency gains. We evaluated the GPU sharing solution offered by KAI v0.2.0, with a simple use case of running two vLLM models on the same NVIDIA T4 GPU. We demonstrated the importance of providing proper isolation and virtualization of GPU devices so that existing workloads can run without complicated tuning and modifications.
With this challenge in mind, we introduced Exostellar SDG, a GPU virtualization solution that offers better combination of flexibility, isolation, and performance. SDG is vendor-independent, working with different generations of NVIDIA GPUs, as well as AMD GPUs and other accelerators. With SDG, we successfully run two vLLM models on the same T4 GPU, without any modifications to vLLM parameters.
Early access: https://exostellar.io/gpu-optimizer-early-access/