Complete
Details
Assignee
Waqar
WaqarReporter
Matt Mabis
Matt MabisLabels
Impact
Medium
Time remaining
0m
Components
Fix versions
Affects versions
Priority
Katalon Platform
Created October 8, 2021 at 7:34 PM
Updated July 6, 2022 at 8:58 PM
Resolved October 22, 2021 at 11:36 AM
Working with Sonicaj on discord asked me to create ticket.
Using a Tesla P4 GPU which is recognized by NVIDIA-SMI command, had to do a lot of tweaks to get it visible in the Truenas GUI in Charts/K3s however even when setting GPU to 1 it still fails.
From what we gather the Tesla GPUs are supported by K3s however not recognized natively via the Truenas Scale platform. the Telsa is identified as a 3D Controller not a VGA Controller
kronos# midclt call device.get_gpus | jq [ { "addr": { "pci_slot": "0000:06:00.0", "domain": "0000", "bus": "06", "slot": "00" }, "description": "ASPEED Technology, Inc. ASPEED Graphics Family", "devices": [ { "pci_id": "1A03:2000", "pci_slot": "0000:06:00.0", "vm_pci_slot": "pci_0000_06_00_0" } ], "vendor": null, "available_to_host": true } ]
kronos# lspci |grep -i 3d 03:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
kronos# lspci |grep -i graphics 06:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30)
Here are some steps i took to get it recoginized by K3s/Truenas Scale UI but still doesnt work correctly as far as i can tell with Plex.
Modify /etc/docker/daemon.json - add into existing json between first and last brackets
"default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }
install - from https://github.com/NVIDIA/k8s-device-plugin (when asked to modify daemon.json say no
$ sudo apt-get update && sudo apt-get install -y nvidia-docker2
$ sudo systemctl restart docker
modify /etc/nvidia-container-runtime/config.toml remove @ from ldconfig (ldconfig = "/sbin/ldconfig")
modify watch (prevents fs watcher: no space left on device (default is 8192) very low sysctl fs.inotify.max_user_watches=1048576
create k3s instance k3s kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml
validate k3s instance working
k3s kubectl logs -f nvidia-device-plugin-daemonset-zmd4s -n kube-system
if all is good should look like
kronos# k3s kubectl logs -f nvidia-device-plugin-daemonset-zmd4s -n kube-system
2021/09/26 04:04:31 Loading NVML
2021/09/26 04:04:31 Starting FS watcher.
2021/09/26 04:04:31 Starting OS watcher.
2021/09/26 04:04:31 Retreiving plugins.
2021/09/26 04:04:31 Starting GRPC server for 'nvidia.com/gpu'
2021/09/26 04:04:31 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/09/26 04:04:31 Registered device plugin for 'nvidia.com/gpu' with Kubelet
From what i gather it should have worked right out of the box but my guess is because the device is recognized as a 3D Controller its being missed.