Behaves as Intended
Details
Details
Assignee
Triage Team
Triage TeamReporter
Stavros Kois
Stavros KoisLabels
Impact
Low
Components
Fix versions
Affects versions
Priority
More fields
More fields
Katalon Platform
Katalon Platform
Created April 18, 2021 at 4:21 PM
Updated July 1, 2022 at 5:25 PM
Resolved April 22, 2021 at 3:48 AM
The pod nvidia-device-plugin-daemon stays in CrashLoopBackOff.
I have an old nvidia GPU in the test machine, so i don't know if this is the problem here.
My GPU is VGA compatible controller: NVIDIA Corporation GF106GL [Quadro 2000] (rev a1)
TNS version: TrueNAS-SCALE-21.04-MASTER-20210418-092917
```
truenas# k get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
ix-traefik svclb-traefik-624rq 0/6 Error 60 45h
kube-system openebs-zfs-controller-0 0/5 Error 310 24d
ix-traefik svclb-traefik-udp-2bksj 1/1 Running 11 45h
kube-system openebs-zfs-node-x5xk6 2/2 Running 126 24d
ix-traefik traefik-58747b4586-qbdmn 1/1 Running 11 45h
kube-system coredns-854c77959c-pk97r 1/1 Running 63 24d
kube-system nvidia-device-plugin-daemonset-crrs8 0/1 CrashLoopBackOff 7198 24d
ix-handbrake handbrake-86d9c85cd7-7hfx2 0/1 Completed 3 16h
ix-collabora collabora-collabora-online-759dbc6c5c-64rn5 1/1 Running 11 45h
```
```
truenas# k describe pod nvidia-device-plugin-daemonset-crrs8 -n kube-system
Name: nvidia-device-plugin-daemonset-crrs8
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: ix-truenas/10.10.10.230
Start Time: Thu, 25 Mar 2021 17:47:25 +0200
Labels: controller-revision-hash=5fc7948cb6
name=nvidia-device-plugin-ds
pod-template-generation=1
Annotations: k8s.v1.cni.cncf.io/network-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"172.16.0.2"
],
"mac": "e6:66:1a:3f:76:7e",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"172.16.0.2"
],
"mac": "e6:66:1a:3f:76:7e",
"default": true,
"dns": {}
}]
scheduler.alpha.kubernetes.io/critical-pod:
Status: Running
IP: 172.16.0.2
IPs:
IP: 172.16.0.2
Controlled By: DaemonSet/nvidia-device-plugin-daemonset
Containers:
nvidia-device-plugin-ctr:
Container ID: docker://d70d0ea958e5f29f20b4abeae84720d539b514e33d0ddd08eb7f22371a756c37
Image: nvidia/k8s-device-plugin:1.0.0-beta6
Image ID: docker-pullable://nvidia/k8s-device-plugin@sha256:00c700122ebc5533e87bf2df193f457d2c2ee37a4a97999466a9a388617cb16b
Port: <none>
Host Port: <none>
State: Waiting
Reason: RunContainerError
Last State: Terminated
Reason: ContainerCannotRun
Message: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
Exit Code: 128
Started: Sun, 18 Apr 2021 19:14:28 +0300
Finished: Sun, 18 Apr 2021 19:14:28 +0300
Ready: False
Restart Count: 7198
Environment: <none>
Mounts:
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-d6cb9 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
default-token-d6cb9:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-d6cb9
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
------ ---- ---- -------
Warning BackOff 7m1s (x1634 over 6h11m) kubelet Back-off restarting failed container
Warning FailedMount 57s kubelet MountVolume.SetUp failed for volume "default-token-d6cb9" : failed to sync secret cache: timed out waiting for the condition
Warning NetworkNotReady 55s (x3 over 58s) kubelet network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Normal SandboxChanged 53s kubelet Pod sandbox changed, it will be killed and re-created.
Normal AddedInterface 40s multus Add eth0 [172.16.0.2/16]
Normal Pulled 23s (x2 over 40s) kubelet Container image "nvidia/k8s-device-plugin:1.0.0-beta6" already present on machine
Normal Created 9s (x2 over 25s) kubelet Created container nvidia-device-plugin-ctr
Warning Failed 7s (x2 over 23s) kubelet Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
```