Uploaded image for project: 'TrueNAS'
  1. TrueNAS
  2. NAS-107364

Scrub causes system "catatonic", apparently due to extreme CPU starvation

    XMLWordPrintable

    Details

    • Impact:
      High

      Description

      This is a confirmed issue which has come up in 2 or 3 bug reports now, and I've been told is worth opening an issue for  (NAS-107256 William Grzybowski, NAS-107036 Alexander Motin). It's also covered in a forum request for help.

      I do need to reproduce the issue a bit, and get debug file etc when it happens, and I will do so in the next day or 2. I want to exclude if any tunables trigger it, for example. For now I will summarise the data and reports from the 4 different tickets + forum thread where it's confirmed or likely to be the underlying issue.

      I have reattached all relevant files to this issue, for convenience.  Also will upload a video taken showing console behaviour as simple commands are typed.

       

      System behaviour (from forum thread):

      Description copied from "Correct tunable to reduce out of control scrub CPU usage (z_rd_int)?":

       

      I've got a 4 core CPU + 256 GB RAM, and scrub on 2 pools is leaving almost no resources at all for user activity. SSH and Web UI sessions cant connect, the shell prompt and console key presses take around 3~12 seconds to appear after keys are pressed, nice --5 doesnt seem to help. It's not RAM starvation as top says it's got 115GB free.

      Pausing scrub fixes it immediately, but I need scrub to be able to work in the background so it's not really a solution.

      Top says its related to z_rd_int 
      [see screenshot attached to this issue], and there are threads on Google saying that indeed this can max out CPU. But I can't find a thread that says, specifically, on FreeBSD, how to tune it back under control. The only relevant tunables I can find seem to be vdev.scrub_max_active (amount of simultaneous IO that scrub can issue, as z_rd_int sounds like scrub IO) and sync_taskq_batch_pct which controls max % of CPU the queue can use. They don't seem to be helping.

      So I swapped the 4 core Xeon for an 8 core Xeon and immediately scrub now wants to consume 100% of 8 cores not 100% of 4 cores. Great win!

      The server is totally idle otherwise - no local tasks, no client use, other than a console session and scrubbing. I've tried all tunables I can find that could bring scrub under control without affecting other operations. Right now I'm at:

      vfs.zfs.vdev.scrub_min_active=0 (down from 1, essentially allowing it to do nothing)
      vfs.zfs.vdev.scrub_max_active=1 (down from 2)
      vfs.zfs.scan_vdev_limit=50000 (down from 40M)

      and it's still using 100% of an 8 core Xeon v4 for scrub to the point that WebUI, SSH and largely, keystrokes in the console, fail or are greatly delayed.  Probably because scrub controls focus on I/O, not the implied CPU hashing/checksumming demands??

      sretalla suggested tentatively that the known firmware issue with LSI 9300 HBAs locking up might be relevant, but there isn't a version of the special xx.12 firmware for my card (9305-24i) and in any case the issue only appears to occur during scrub not other disk activity.

      No such problem ever arose with 11.3 on same hardware and 4 core.

      Files: screenshot of "top" showing CPU, RAM, and many z_rd_int's

       

      FROM NAS-107138

      Reported that when attempting to do simple commands that traversed /.zfs/snapshot/ directories, the system experienced extreme slowdown, to the point I could focus a camera and photograph the system "catatonic" in between typing "ls [ENTER]" and before any output displayed on the console.

      This issue occurred very soon after another issue involving zfs hold limits reached (NAS-107049) which was traced by Ryan Moeller to inadequate nvlist sizing and now fixed for 12-RC1. Because of that, and because the symptom happened when traversing a snapshot directory, I initially interpreted the issue to be a system stall/lock of some kind due to waiting for holds on, because the scenario involved traversing a snapshot which of course requires internal holds. But NAS-107138 also makes clear that one or more scrubs were also been in progress in the background, so I now think that the extreme slowness may have been induced by the background scrubs, not by hold issues.  Having seen other cases where this happened, the symptoms in NAS-107138 are just too close to other issues where extreme slowness happened during scrubbing in 12-BETA1/2.

      I ran a dtrace script given to me in NAS-107049 related to holds/nvlist in the hope it would produce helpful clarification. It probably didn't instrument exactly what was needed, but better than nothing.  

      Alexander Motin commented on the findings: 

      "I am curious, how many snapshots do you have there?  Can it be that it is just reasonably busy?  Have you looked what CPU and disks are doing during that?  I see the dtrace script logs many spa_deadman() timeouts, that may mean pool I/O being blocked by something or being extremely slow."

      I commented that the system had 20k snaps now reduced to about 6k. On 11.3 I had around 80k ~ 120k snaps, and no similar issues at all (pool had been snapped at 15 minute intervals but hadn't got round to destroying them for many weeks!).  The system also was heavily specced and not busy, and other heavy use such as local replication on 12-beta2 hammers the pools, but doesn't show any slowdown to the user.

      Files:  dtrace, "hung console" screenshot of "ls" hanging

       

      FROM NAS-107256

      The symptom was a complete cluster of middleware/services falling over badly, stalled, or otherwise unresponsive, to the point that I had to reboot. I initially thought scrub wasn't active then when prompted rechecked and found I'd paused a scrub 2 hours later (I'd been pausing and unpausing scrub when I needed the server responsive!) so indeed it was possible that I'd had a scrub going when the symptoms happened.

      William Grzybowski commented that the pool client-server issue was believed fixed for RC1, but that:

      "My current theory is that these concurrent scrubs are causing middleware to start misbehave and create a cascade effect."

      Files: debug file

       

      FROM NAS-107036

      Little of value extra. Essentially covers identical ground to the forum postand the fixed middleware problem I just mentioned. Nothing extra.

      Alexander Motin commented in regard of the scrub related catatonic behaviour (described already in the forum post above) that:

      "it looks like worth reproduction/investigation. I don't remember complains like that."

      Files:  Web UI screenshot, various logs and such (couldn;'t produce a debug file)

       

      FROM NAS-107173

      I don't know if this is relevant or not, worth review? There were repeated (several) kernel panics, at a time when the system was totally idle apart from heavy ongoing scrub activity.  Crashdumps were not produced, but I caught console panic tracebacks of 2 of these panics, on my phone camera. I missed the others. 

      Alexander Motin initially saw little helpful in the data for troubleshooting purposes, as the 2 panic screenshots available didn't seem to have much in common.  He suggested checking hardware as a first step (done). But then noticed they both involved breakpoints, and indeed dtrace had been in use recently. That could be a red herring, or indeed, a distinct issue unrelated to the scrub pathology. I can't tell.  But deep kernel or ZFS pathological stuff specifically during heavy scrub seems too coincidental not to at least mention.

      Alexander Motin's comments:

      "Unfortunately the only common piece between the two panics I see is callout_reset_sbt_on() function call, called from two very different places, one by the nvme(4) driver, and another from ZFS.  If those have common source, then it is not seen here.  I need some more input.  And make sure it is not a memory or otherwise hardware issue."

      "Unfortunately cases when dump is not saved are pretty often [...] Hmm. Just now I've noticed that these panics reported as: "breakpoint instruction fault while in kernel mode".  I don't think I saw that before.  I guess it may be related to DTrace. You you see any dtrace process running on your system when it happen?"

      My reply:

      Ryan Moeller and I used [dtrace] on  around 1000 snaps?">NAS-107049, a bug where ZFS was running out of holds in 12-beta1 far sooner than it ever should, causing really nasty hangs and fails in routine operations like send/recv, running out of holds in replications of about 1000 (1K) snaps. Ryan traced it to a too-low value for the size of nvlists passed in for ioctls. We used 2 dtrace lines in that issue:

      dtrace -n 'zfs-dbgmsg { print(stringof(arg0)); } set-error { stack(); }' > debug.txt

      [A second dtrace script mentioned in this issue was provided but hadn't yet been used at the time.]

      The dtrace code runs endlessly in a loop. The instruction was to run it and CTRL-C after the issue happens and send the log. So if dtrace cleans up breakpoints on exit, then can any residual breakpoints from dtrace have existed some time later to cause a panic? I am confused by that. Maybe me giving above, the exact dtrace used, can help to work out if it's the possible cause (similar code area/execution path???). And why would the panic also occur after the reboot after the panic, when no dtrace was re-run after reboot? Questions! Hence why i asked what else breakpoint can be?

      In any case I've upgraded to 12-BETA2, so any residue in the code is flushed out?

      Files: Screenshots (photos) of the 2 panic dumps and of the lack of crashdumps/textdumps.

        Attachments

          Attachments

            JEditor

              Issue Links

                Activity

                  People

                  Assignee:
                  mav Alexander Motin
                  Reporter:
                  Stilez Stilez y
                  Votes:
                  0 Vote for this issue
                  Watchers:
                  8 Start watching this issue

                    Dates

                    Created:
                    Updated:
                    Resolved: