-
Ing. Vojtěch Nikl authoredIng. Vojtěch Nikl authored
notebooks-nvidia-device-plugin
What is this repo?
This repo contains custom Nvidia Device Plugin that allows us to deploy Jupyter notebooks from our Kubernetes Jupyterhub onto gpus with active migs (migs simply divide gpus into smaller equal units, allowing multiple users to work concurently on a single gpu with devided resources. The original Nvidia Device Plugin for some unknown reason doesnt allow the master node to see and work with migs on worker nodes.
How to make the migs work?
Activate migs on target GPU, for example using these commands (repeat to create multiple migs)
nvidia-smi -mig 1
nvidia-smi mig -cci 1g.12gb
nvidia-smi mig -cgi 1g.12gb
Other useful arguments to show and delete migs are
-lci -lgi -dci -dgi
When migs are correctly setup, install custom Nvidia Device Plugin onto the master node (where jupyterhub is installed)
helm install nvidia-device-plugin-custom ./nvidia-device-plugin-custom/ -n kube-system
Check that the master node correctly sees the migs
kubectl describe node k8s-staging1-gpu-0
where this line should contain the number of available migs (should be 1 or more)
nvidia.com/mig-1g.11gb: 7
Other useful commands to work with the plugin
helm uninstall nvidia-device-plugin-custom -n kube-system
kubectl delete pod -n kube-system nvidia-device-plugin-egi-5xwsm
kubectl logs nvidia-device-plugin-egi-87bkb -n kube-system
How to test and benchmark the migs?
Folder test-mig-master
shows how to describe migs in kubernetes yaml file and how to display the migs using nvidia-smi inside kubernetes container.
Folder benchmark
constain python code with timed matrix multiplication algorithm to test the performance of the migs and whole gpus. Adjust the N variable representing the size of matrices. We tested with N=16384, one whole nvidia H100 gpu without migs was able to complete the benchmark in exactly 10 seconds, and after diving the gpu into 7 equal migs (the highest amount available, no smaller migs are possible on H100), each mig was able to run the same benchmark in 75s (+-1s) regardless of the total number of migs running the benchmark. That shows that there is very little overhead with using migs and performance of all migs is almost equal the performance of the same mig-less gpu, and also that the performance for each user wont be affected by other users using other migs on the same gpu.
Folder test-mig-jupyterlab
shows how to test the migs inside a docker container within the environment of a single jupyterlab notebook.