- Chassis: Supermicro AS-1113S-WN10RT
- Mainboard: H11SSW-NT
- Disk drives: 6x Intel SSDPE2KX010T8
With FreeNAS 11.2-U3 as soon as there are more than 4 of these drives in the system any moderate write load on the drives leads to errors like this:
Apr 12 13:42:16 freenas01 nvme6: aborting outstanding i/o Apr 12 13:42:16 freenas01 nvme6: WRITE sqid:1 cid:117 nsid:1 lba:981825104 len:176 Apr 12 13:42:16 freenas01 nvme6: ABORTED - BY REQUEST (00/07) sqid:1 cid:117 cdw0:0 Apr 12 13:42:49 freenas01 nvme6: resetting controller Apr 12 13:42:50 freenas01 nvme6: aborting outstanding i/o Apr 12 13:42:50 freenas01 nvme6: WRITE sqid:1 cid:127 nsid:1 lba:984107936 len:96 Apr 12 13:42:50 freenas01 nvme6: ABORTED - BY REQUEST (00/07) sqid:1 cid:127 cdw0:0 Apr 12 13:43:35 freenas01 nvme6: resetting controller
In a discussion on freebsd-stable we came to suspect that the NVMe driver in FreeBSD 11 misses completion interrupts issued by the device when finishing a task and then runs into timeouts.
This leads to the system becoming unresponsive.
Tests with plain FreeBSD without FreeNAS show that 11-STABLE does exhibit the problem while 12-STABLE doesn't.
All hardware components have the latest BIOS/firmware as provided by the vendor.
There have been substantial changes in the NVMe subsystem in FreeBSD >=12, initially targeting endianess problems on e.g. Sparc64, but some code to specifically deal with missed interrupts was added to nvme_timeout() in nvme_qpair.c - with a 12-STABLE kernel my system loggs a "Missing interrupt" every half an hour or so under synthetic write load, but runs otherwise stable. An 11-STABLE system hangs seconds after I start my "dd" jobs.
More details can be found in the added links.