Uploaded image for project: 'TrueNAS'
  1. TrueNAS
  2. NAS-110249

nvidia-device-plugin-daemon CrashLoopBackOff

    XMLWordPrintable

Details

    • Low

    Description

      The pod nvidia-device-plugin-daemon stays in CrashLoopBackOff. 
      I have an old nvidia GPU in the test machine, so i don't know if this is the problem here.
      My GPU is VGA compatible controller: NVIDIA Corporation GF106GL [Quadro 2000] (rev a1)
      TNS version: TrueNAS-SCALE-21.04-MASTER-20210418-092917


      ```

      truenas# k get pods --all-namespaces
      NAMESPACE      NAME                                          READY   STATUS             RESTARTS   AGE
      ix-traefik     svclb-traefik-624rq                           0/6     Error              60         45h
      kube-system    openebs-zfs-controller-0                      0/5     Error              310        24d
      ix-traefik     svclb-traefik-udp-2bksj                       1/1     Running            11         45h
      kube-system    openebs-zfs-node-x5xk6                        2/2     Running            126        24d
      ix-traefik     traefik-58747b4586-qbdmn                      1/1     Running            11         45h
      kube-system    coredns-854c77959c-pk97r                      1/1     Running            63         24d
      kube-system    nvidia-device-plugin-daemonset-crrs8          0/1     CrashLoopBackOff   7198       24d
      ix-handbrake   handbrake-86d9c85cd7-7hfx2                    0/1     Completed          3          16h
      ix-collabora   collabora-collabora-online-759dbc6c5c-64rn5   1/1     Running            11         45h

      ```

      ```


      truenas# k describe pod nvidia-device-plugin-daemonset-crrs8 -n kube-system
      Name:                 nvidia-device-plugin-daemonset-crrs8
      Namespace:            kube-system
      Priority:             2000001000
      Priority Class Name:  system-node-critical
      Node:                 ix-truenas/10.10.10.230
      Start Time:           Thu, 25 Mar 2021 17:47:25 +0200
      Labels:               controller-revision-hash=5fc7948cb6
                            name=nvidia-device-plugin-ds
                            pod-template-generation=1
      Annotations:          k8s.v1.cni.cncf.io/network-status:
                              [{
                                  "name": "",
                                  "interface": "eth0",
                                  "ips": [
                                      "172.16.0.2"
                                  ],
                                  "mac": "e6:66:1a:3f:76:7e",
                                  "default": true,
                                  "dns": {}
                              }]
                            k8s.v1.cni.cncf.io/networks-status:
                              [{
                                  "name": "",
                                  "interface": "eth0",
                                  "ips": [
                                      "172.16.0.2"
                                  ],
                                  "mac": "e6:66:1a:3f:76:7e",
                                  "default": true,
                                  "dns": {}
                              }]
                            scheduler.alpha.kubernetes.io/critical-pod:
      Status:               Running
      IP:                   172.16.0.2
      IPs:
        IP:           172.16.0.2
      Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
      Containers:
        nvidia-device-plugin-ctr:
          Container ID:   docker://d70d0ea958e5f29f20b4abeae84720d539b514e33d0ddd08eb7f22371a756c37
          Image:          nvidia/k8s-device-plugin:1.0.0-beta6
          Image ID:       docker-pullable://nvidia/k8s-device-plugin@sha256:00c700122ebc5533e87bf2df193f457d2c2ee37a4a97999466a9a388617cb16b
          Port:           <none>
          Host Port:      <none>
          State:          Waiting
            Reason:       RunContainerError
          Last State:     Terminated
            Reason:       ContainerCannotRun
            Message:      OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
            Exit Code:    128
            Started:      Sun, 18 Apr 2021 19:14:28 +0300
            Finished:     Sun, 18 Apr 2021 19:14:28 +0300
          Ready:          False
          Restart Count:  7198
          Environment:    <none>
          Mounts:
            /var/lib/kubelet/device-plugins from device-plugin (rw)
            /var/run/secrets/kubernetes.io/serviceaccount from default-token-d6cb9 (ro)
      Conditions:
        Type              Status
        Initialized       True
        Ready             False
        ContainersReady   False
        PodScheduled      True
      Volumes:
        device-plugin:
          Type:          HostPath (bare host directory volume)
          Path:          /var/lib/kubelet/device-plugins
          HostPathType:
        default-token-d6cb9:
          Type:        Secret (a volume populated by a Secret)
          SecretName:  default-token-d6cb9
          Optional:    false
      QoS Class:       BestEffort
      Node-Selectors:  <none>
      Tolerations:     CriticalAddonsOnly op=Exists
                       node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                       node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                       node.kubernetes.io/not-ready:NoExecute op=Exists
                       node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                       node.kubernetes.io/unreachable:NoExecute op=Exists
                       node.kubernetes.io/unschedulable:NoSchedule op=Exists
                       nvidia.com/gpu:NoSchedule op=Exists
      Events:
        Type     Reason           Age                      From     Message
        ----     ------           ----                     ----     -------
        Warning  BackOff          7m1s (x1634 over 6h11m)  kubelet  Back-off restarting failed container
        Warning  FailedMount      57s                      kubelet  MountVolume.SetUp failed for volume "default-token-d6cb9" : failed to sync secret cache: timed out waiting for the condition
        Warning  NetworkNotReady  55s (x3 over 58s)        kubelet  network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
        Normal   SandboxChanged   53s                      kubelet  Pod sandbox changed, it will be killed and re-created.
        Normal   AddedInterface   40s                      multus   Add eth0 [172.16.0.2/16]
        Normal   Pulled           23s (x2 over 40s)        kubelet  Container image "nvidia/k8s-device-plugin:1.0.0-beta6" already present on machine
        Normal   Created          9s (x2 over 25s)         kubelet  Created container nvidia-device-plugin-ctr
        Warning  Failed           7s (x2 over 23s)         kubelet  Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
      ```

       

      Attachments

        Attachments

          JEditor

            Activity

              People

                releng Triage Team
                StavrosMadK Stavros Kois
                Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                  Created:
                  Updated:
                  Resolved: