Middleware and UI waiting for L2ARC rebuild greatly increases failover time

Description

142 HDDs in 71 mirror VDEVs

4x NVMe L2ARC

A long running, read-heavy workload had loaded over 4 TiB of data into L2ARC. Upon failover, the following message appeared on the serial console and the newly elected active controller would not display the UI. During the L2ARC rebuild, one CPU core per NVMe device was 100% pegged so the rebuild seems to be CPU bound.

ctl_ha_role_sysctl: CTL_LUNREQ_MODIFY returned 3 'no file argument specified'

Once the rebooted controller fully booted, we saw the following message:

carp: 20@ntb0: MASTER -> BACKUP (more frequent advertisement received)

In the end, after 33 minutes, the UI came up and all was well, the import succeeded. However, ZFS is ready to service IO long before the L2ARC rebuild is complete, so we believe middleware should not block on the complete rebuild (given how massive L2ARC has a potential to be).

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

Alexander Motin October 27, 2020 at 2:27 AM

This should fix the issue: https://github.com/openzfs/zfs/pull/11116 .

Ryan McKenzie October 21, 2020 at 8:34 PM

Need to see if this issue is somehow related to or similar to in that the userland threads are being starved for memory resources...

Alexander Motin September 10, 2020 at 5:17 PM

I've reopened the ticket while we investigate more what is going on.  Persistent L2ARC is disabled for now.

Alexander Motin August 21, 2020 at 7:11 PM

Lets count it as done until any more issues found.

Alexander Motin August 21, 2020 at 5:59 PM

I've found several issues related to large L2ARC and its rebuild: https://github.com/openzfs/zfs/pull/10765 .  I think those could cause the ill effects just by blocking normal ARC operation.  I am not sure we really need throttle there.  The patch should be in next internal/nightly build, so would be good to retest it to check whether we still see any problems there.

Complete

Details

Assignee

Reporter

Labels

Impact

Low

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Created July 29, 2020 at 4:51 PM
Updated July 1, 2022 at 4:51 PM
Resolved November 12, 2020 at 2:35 PM