Scrub causes system "catatonic", apparently due to extreme CPU starvation

Description

This is a confirmed issue which has come up in 2 or 3 bug reports now, and I've been told is worth opening an issue for ( William Grzybowski , Alexander Motin ). It's also covered in a forum request for help.

I do need to reproduce the issue a bit, and get debug file etc when it happens, and I will do so in the next day or 2. I want to exclude if any tunables trigger it, for example. For now I will summarise the data and reports from the 4 different tickets + forum thread where it's confirmed or likely to be the underlying issue.

I have reattached all relevant files to this issue, for convenience. Also will upload a video taken showing console behaviour as simple commands are typed.

System behaviour (from forum thread):

Description copied from " Correct tunable to reduce out of control scrub CPU usage (z_rd_int)? ":

I've got a 4 core CPU + 256 GB RAM, and scrub on 2 pools is leaving almost no resources at all for user activity. SSH and Web UI sessions cant connect, the shell prompt and console key presses take around 3~12 seconds to appear after keys are pressed, nice --5 doesnt seem to help. It's not RAM starvation as top says it's got 115GB free.

Pausing scrub fixes it immediately, but I need scrub to be able to work in the background so it's not really a solution.

Top says its related to z_rd_int [see screenshot attached to this issue] , and there are threads on Google saying that indeed this can max out CPU. But I can't find a thread that says, specifically, on FreeBSD, how to tune it back under control. The only relevant tunables I can find seem to be vdev.scrub_max_active (amount of simultaneous IO that scrub can issue, as z_rd_int sounds like scrub IO) and sync_taskq_batch_pct which controls max % of CPU the queue can use. They don't seem to be helping.

So I swapped the 4 core Xeon for an 8 core Xeon and immediately scrub now wants to consume 100% of 8 cores not 100% of 4 cores. Great win!

The server is totally idle otherwise - no local tasks, no client use, other than a console session and scrubbing. I've tried all tunables I can find that could bring scrub under control without affecting other operations. Right now I'm at:

vfs.zfs.vdev.scrub_min_active=0 (down from 1, essentially allowing it to do nothing)
vfs.zfs.vdev.scrub_max_active=1 (down from 2)
vfs.zfs.scan_vdev_limit=50000 (down from 40M)

and it's still using 100% of an 8 core Xeon v4 for scrub to the point that WebUI, SSH and largely, keystrokes in the console, fail or are greatly delayed. Probably because scrub controls focus on I/O, not the implied CPU hashing/checksumming demands??

sretalla suggested tentatively that the known firmware issue with LSI 9300 HBAs locking up might be relevant, but there isn't a version of the special xx.12 firmware for my card (9305-24i) and in any case the issue only appears to occur during scrub not other disk activity.

No such problem ever arose with 11.3 on same hardware and 4 core.

Files: screenshot of "top" showing CPU, RAM, and many z_rd_int's

FROM

Reported that when attempting to do simple commands that traversed /.zfs/snapshot/ directories, the system experienced extreme slowdown, to the point I could focus a camera and photograph the system "catatonic" in between typing "ls [ENTER]" and before any output displayed on the console.

This issue occurred very soon after another issue involving zfs hold limits reached () which was traced by Ryan Moeller to inadequate nvlist sizing and now fixed for 12-RC1. Because of that, and because the symptom happened when traversing a snapshot directory, I initially interpreted the issue to be a system stall/lock of some kind due to waiting for holds on, because the scenario involved traversing a snapshot which of course requires internal holds. But also makes clear that one or more scrubs were also been in progress in the background, so I now think that the extreme slowness may have been induced by the background scrubs, not by hold issues. Having seen other cases where this happened, the symptoms in are just too close to other issues where extreme slowness happened during scrubbing in 12-BETA1/2.

I ran a dtrace script given to me in related to holds/nvlist in the hope it would produce helpful clarification. It probably didn't instrument exactly what was needed, but better than nothing.

Alexander Motin commented on the findings:

"I am curious, how many snapshots do you have there? Can it be that it is just reasonably busy? Have you looked what CPU and disks are doing during that? I see the dtrace script logs many spa_deadman() timeouts, that may mean pool I/O being blocked by something or being extremely slow."

I commented that the system had 20k snaps now reduced to about 6k. On 11.3 I had around 80k ~ 120k snaps, and no similar issues at all (pool had been snapped at 15 minute intervals but hadn't got round to destroying them for many weeks!). The system also was heavily specced and not busy, and other heavy use such as local replication on 12-beta2 hammers the pools, but doesn't show any slowdown to the user.

Files: dtrace, "hung console" screenshot of "ls" hanging

FROM

The symptom was a complete cluster of middleware/services falling over badly, stalled, or otherwise unresponsive, to the point that I had to reboot. I initially thought scrub wasn't active then when prompted rechecked and found I'd paused a scrub 2 hours later (I'd been pausing and unpausing scrub when I needed the server responsive!) so indeed it was possible that I'd had a scrub going when the symptoms happened.

William Grzybowski commented that the pool client-server issue was believed fixed for RC1, but that:

"My current theory is that these concurrent scrubs are causing middleware to start misbehave and create a cascade effect."

Files: debug file

FROM

Little of value extra. Essentially covers identical ground to the forum postand the fixed middleware problem I just mentioned. Nothing extra.

Alexander Motin commented in regard of the scrub related catatonic behaviour (described already in the forum post above) that:

"it looks like worth reproduction/investigation. I don't remember complains like that."

Files: Web UI screenshot, various logs and such (couldn;'t produce a debug file)

FROM

I don't know if this is relevant or not, worth review? There were repeated (several) kernel panics, at a time when the system was totally idle apart from heavy ongoing scrub activity . Crashdumps were not produced, but I caught console panic tracebacks of 2 of these panics, on my phone camera. I missed the others.

Alexander Motin initially saw little helpful in the data for troubleshooting purposes, as the 2 panic screenshots available didn't seem to have much in common. He suggested checking hardware as a first step (done). But then noticed they both involved breakpoints, and indeed dtrace had been in use recently. That could be a red herring, or indeed, a distinct issue unrelated to the scrub pathology. I can't tell. But deep kernel or ZFS pathological stuff specifically during heavy scrub seems too coincidental not to at least mention.

Alexander Motin 's comments:

"Unfortunately the only common piece between the two panics I see is callout_reset_sbt_on() function call, called from two very different places, one by the nvme(4) driver, and another from ZFS. If those have common source, then it is not seen here. I need some more input. And make sure it is not a memory or otherwise hardware issue."

"Unfortunately cases when dump is not saved are pretty often [...] Hmm. Just now I've noticed that these panics reported as: "breakpoint instruction fault while in kernel mode". I don't think I saw that before. I guess it may be related to DTrace. You you see any dtrace process running on your system when it happen?"

My reply:

Ryan Moeller and I used [dtrace] on around 1000 snaps?"> , a bug where ZFS was running out of holds in 12-beta1 far sooner than it ever should, causing really nasty hangs and fails in routine operations like send/recv, running out of holds in replications of about 1000 (1K) snaps. Ryan traced it to a too-low value for the size of nvlists passed in for ioctls. We used 2 dtrace lines in that issue:

dtrace -n 'zfs-dbgmsg { print(stringof(arg0)); } set-error { stack(); }' > debug.txt

[A second dtrace script mentioned in this issue was provided but hadn't yet been used at the time.]

The dtrace code runs endlessly in a loop. The instruction was to run it and CTRL-C after the issue happens and send the log. So if dtrace cleans up breakpoints on exit, then can any residual breakpoints from dtrace have existed some time later to cause a panic? I am confused by that. Maybe me giving above, the exact dtrace used, can help to work out if it's the possible cause (similar code area/execution path???). And why would the panic also occur after the reboot after the panic, when no dtrace was re-run after reboot? Questions! Hence why i asked what else breakpoint can be?

In any case I've upgraded to 12-BETA2, so any residue in the code is flushed out?

Files: Screenshots (photos) of the 2 panic dumps and of the lack of crashdumps/textdumps.

Problem/Justification

None

Impact

None

Linked work items

is duplicated by

NAS-107138

ZFS bad session hang during "ls" of PATH/.zfs/snapshot. dtrace attached.

relates to

NAS-107549

UNMAP delays possibly causing FC errors

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Stilez May 17, 2021 at 4:48 PM

very, very appreciated, thank you for all your hard work and dedication on this and other ZFS issues, Alex.

Alexander Motin May 17, 2021 at 3:52 PM

I think PR 11966 integrated into U4 and 12046 hopefully landing in U5 should cover most of the issue. If that appear not enough, then next step would have to be an explicit scrub throttling, that I would prefer to avoid.

Alexander Motin May 17, 2021 at 3:44 PM
Edited

This patch should reduce excessive scrub queue depth for some pool topologies, reducing payload latency in CPU-bound cases, and also allow lower values of vfs.zfs.scan_vdev_limit to be effective: https://github.com/openzfs/zfs/pull/12046

Alexander Motin April 29, 2021 at 3:53 PM

This should partially help with the issue: https://github.com/openzfs/zfs/pull/11966

gcs8 October 16, 2020 at 5:59 PM

Will have to see, might try putting the dedup dataset back on that pool later, I grabbed a set of E5-2673 v4's to double the core/thread count. It could be a dedup + scrub + gli encryption issue also. It seems to be a bit more smooth sailing right now with the dedup dataset nuked from that pool.

Resize issue view side panel

Complete

Details

Assignee

Alexander Motin

Reporter

Stilez

Labels

Impact

High

Components

Fix versions

12.0-U4

Affects versions

12.0-BETA2

Priority

Medium

More fields

Katalon Platform

Created August 26, 2020 at 10:08 AM

Updated July 1, 2022 at 4:55 PM

Resolved May 17, 2021 at 3:52 PM

Scrub causes system "catatonic", apparently due to extreme CPU starvation

Description

Problem/Justification

Impact

Linked work items

is duplicated by

relates to

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Stilez May 17, 2021 at 4:48 PM

Alexander Motin May 17, 2021 at 3:52 PM

Alexander Motin May 17, 2021 at 3:44 PMEdited

Alexander Motin April 29, 2021 at 3:53 PM

gcs8 October 16, 2020 at 5:59 PM

Details

Assignee

Reporter

Labels

Impact

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Alexander Motin May 17, 2021 at 3:44 PM
Edited