Uploaded image for project: 'TrueNAS'
  1. TrueNAS
  2. NAS-106992

Middleware and UI waiting for L2ARC rebuild greatly increases failover time

    XMLWordPrintable

    Details

    • Impact:
      Low

      Description

      142 HDDs in 71 mirror VDEVs

      4x NVMe L2ARC

      A long running, read-heavy workload had loaded over 4 TiB of data into L2ARC. Upon failover, the following message appeared on the serial console and the newly elected active controller would not display the UI. During the L2ARC rebuild, one CPU core per NVMe device was 100% pegged so the rebuild seems to be CPU bound.

      ctl_ha_role_sysctl: CTL_LUNREQ_MODIFY returned 3 'no file argument specified'

      Once the rebooted controller fully booted, we saw the following message:

      carp: 20@ntb0: MASTER -> BACKUP (more frequent advertisement received)

      In the end, after 33 minutes, the UI came up and all was well, the import succeeded. However, ZFS is ready to service IO long before the L2ARC rebuild is complete, so we believe middleware should not block on the complete rebuild (given how massive L2ARC has a potential to be).

        Attachments

        1. debug-tn11a-20200729124926.tar
          1.91 MB
        2. image (8).png
          image (8).png
          41 kB
        3. procstatlog.txt
          8.03 MB
        4. Screen Shot 2020-07-30 at 3.12.41 PM.png
          Screen Shot 2020-07-30 at 3.12.41 PM.png
          148 kB
        5. toplog.txt
          59 kB

          Attachments

            JEditor

              Activity

                People

                Assignee:
                mav Alexander Motin
                Reporter:
                rmckenzie Ryan McKenzie
                Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                  Dates

                  Created:
                  Updated:
                  Resolved: