Uploaded image for project: 'FreeNAS / TrueNAS'
  1. FreeNAS / TrueNAS
  2. NAS-101427

Improve NVMe timeout handling




      • Chassis: Supermicro AS-1113S-WN10RT
      • Mainboard: H11SSW-NT
      • Disk drives: 6x Intel SSDPE2KX010T8

      With FreeNAS 11.2-U3 as soon as there are more than 4 of these drives in the system any moderate write load on the drives leads to errors like this:

      Apr 12 13:42:16 freenas01 nvme6: aborting outstanding i/o
      Apr 12 13:42:16 freenas01 nvme6: WRITE sqid:1 cid:117 nsid:1 lba:981825104 len:176
      Apr 12 13:42:16 freenas01 nvme6: ABORTED - BY REQUEST (00/07) sqid:1 cid:117 cdw0:0
      Apr 12 13:42:49 freenas01 nvme6: resetting controller
      Apr 12 13:42:50 freenas01 nvme6: aborting outstanding i/o
      Apr 12 13:42:50 freenas01 nvme6: WRITE sqid:1 cid:127 nsid:1 lba:984107936 len:96
      Apr 12 13:42:50 freenas01 nvme6: ABORTED - BY REQUEST (00/07) sqid:1 cid:127 cdw0:0
      Apr 12 13:43:35 freenas01 nvme6: resetting controller

      In a discussion on freebsd-stable we came to suspect that the NVMe driver in FreeBSD 11 misses completion interrupts issued by the device when finishing a task and then runs into timeouts.

      This leads to the system becoming unresponsive.

      Tests with plain FreeBSD without FreeNAS show that 11-STABLE does exhibit the problem while 12-STABLE doesn't.

      All hardware components have the latest BIOS/firmware as provided by the vendor.

      There have been substantial changes in the NVMe subsystem in FreeBSD >=12, initially targeting endianess problems on e.g. Sparc64, but some code to specifically deal with missed interrupts was added to nvme_timeout() in nvme_qpair.c - with a 12-STABLE kernel my system loggs a "Missing interrupt" every half an hour or so under synthetic write load, but runs otherwise stable. An 11-STABLE system hangs seconds after I start my "dd" jobs.

      More details can be found in the added links.

      Kind regards,


          Issue Links



              mav Alexander Motin
              pmh Patrick M. Hausen
              0 Vote for this issue
              2 Start watching this issue



                  Summary Panel