nvidia-device-plugin-daemon CrashLoopBackOff

Description

The pod nvidia-device-plugin-daemon stays in CrashLoopBackOff.
I have an old nvidia GPU in the test machine, so i don't know if this is the problem here.
My GPU is VGA compatible controller: NVIDIA Corporation GF106GL [Quadro 2000] (rev a1)
TNS version: TrueNAS-SCALE-21.04-MASTER-20210418-092917

```

truenas# k get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
ix-traefik svclb-traefik-624rq 0/6 Error 60 45h
kube-system openebs-zfs-controller-0 0/5 Error 310 24d
ix-traefik svclb-traefik-udp-2bksj 1/1 Running 11 45h
kube-system openebs-zfs-node-x5xk6 2/2 Running 126 24d
ix-traefik traefik-58747b4586-qbdmn 1/1 Running 11 45h
kube-system coredns-854c77959c-pk97r 1/1 Running 63 24d
kube-system nvidia-device-plugin-daemonset-crrs8 0/1 CrashLoopBackOff 7198 24d
ix-handbrake handbrake-86d9c85cd7-7hfx2 0/1 Completed 3 16h
ix-collabora collabora-collabora-online-759dbc6c5c-64rn5 1/1 Running 11 45h

```

truenas# k describe pod nvidia-device-plugin-daemonset-crrs8 -n kube-system
Name: nvidia-device-plugin-daemonset-crrs8
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: ix-truenas/10.10.10.230
Start Time: Thu, 25 Mar 2021 17:47:25 +0200
Labels: controller-revision-hash=5fc7948cb6
name=nvidia-device-plugin-ds
pod-template-generation=1
Annotations: k8s.v1.cni.cncf.io/network-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"172.16.0.2"
],
"mac": "e6:66:1a:3f:76:7e",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"172.16.0.2"
],
"mac": "e6:66:1a:3f:76:7e",
"default": true,
"dns": {}
}]
scheduler.alpha.kubernetes.io/critical-pod:
Status: Running
IP: 172.16.0.2
IPs:
IP: 172.16.0.2
Controlled By: DaemonSet/nvidia-device-plugin-daemonset
Containers:
nvidia-device-plugin-ctr:
Container ID: docker://d70d0ea958e5f29f20b4abeae84720d539b514e33d0ddd08eb7f22371a756c37
Image: nvidia/k8s-device-plugin:1.0.0-beta6
Image ID: docker-pullable://nvidia/k8s-device-plugin@sha256:00c700122ebc5533e87bf2df193f457d2c2ee37a4a97999466a9a388617cb16b
Port: <none>
Host Port: <none>
State: Waiting
Reason: RunContainerError
Last State: Terminated
Reason: ContainerCannotRun
Message: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
Exit Code: 128
Started: Sun, 18 Apr 2021 19:14:28 +0300
Finished: Sun, 18 Apr 2021 19:14:28 +0300
Ready: False
Restart Count: 7198
Environment: <none>
Mounts:
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-d6cb9 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
default-token-d6cb9:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-d6cb9
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message

------ ---- ---- -------
Warning BackOff 7m1s (x1634 over 6h11m) kubelet Back-off restarting failed container
Warning FailedMount 57s kubelet MountVolume.SetUp failed for volume "default-token-d6cb9" : failed to sync secret cache: timed out waiting for the condition
Warning NetworkNotReady 55s (x3 over 58s) kubelet network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Normal SandboxChanged 53s kubelet Pod sandbox changed, it will be killed and re-created.
Normal AddedInterface 40s multus Add eth0 [172.16.0.2/16]
Normal Pulled 23s (x2 over 40s) kubelet Container image "nvidia/k8s-device-plugin:1.0.0-beta6" already present on machine
Normal Created 9s (x2 over 25s) kubelet Created container nvidia-device-plugin-ctr
Warning Failed 7s (x2 over 23s) kubelet Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
```

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

Stavros Kois

April 22, 2021 at 3:07 PM

@Waqar Thanks for the info! Will check with upstream

Waqar

April 22, 2021 at 5:45 AM

@Stavros Kois kubernetes has self healing nature so it tries to restart pods which fail unless told otherwise. We are using upstream nvidia daemonset settings and i am afraid we can't change the restart policy for the pod to not restart the pod after a failure as it seems it can crash in ordinary conditions as well but work well after a restart ( like i have a NVIDIA gpu and it shows that the pod was restarted 7 times and it's working fine right now with my apps ). Please feel free to create an upstream issue in this case so that they add probes perhaps if possible to counter this ( https://github.com/NVIDIA/k8s-device-plugin ).

Stavros Kois

April 22, 2021 at 4:56 AM

@Waqar Thanks, but in this case, shouldn't the pod stop trying to start?
Untill the next boot at least which will check again for supported gpu.

Waqar

April 22, 2021 at 3:48 AM

@Stavros Kois the GPU is indeed not supported by the NVIDIA driver we are using. Please refer to https://www.nvidia.com/Download/driverResults.aspx/170804/en-us for details.

Stavros Kois

April 18, 2021 at 7:16 PM

As requested by @Waqar after our TV session,
The exact model of GPU is HP 671136-001 NVIDIA Quadro 2000 PCIe 2.0 x16 graphics card - With 1GB GDDR5

Resize issue view side panel

Behaves as Intended

Details

Assignee

Triage Team

Reporter

Stavros Kois

Labels

Impact

Low

Priority

Low

More fields

Katalon Platform

Created April 18, 2021 at 4:21 PM

Updated July 1, 2022 at 5:25 PM

Resolved April 22, 2021 at 3:48 AM

nvidia-device-plugin-daemon CrashLoopBackOff

Description

Problem/Justification

Impact

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Details

Assignee

Reporter

Labels

Impact

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform