Locking issue in LAGG driver, while using VLANs with e1000, may result in Kernel Panic while booting

Description

Hello, one of my systems have a lagg interface with four gigabit ethernet cards with VLANs on top of it. After a reboot on the system, the system never came back online crashing during boot, just after importing the pool. Which probabily reduced my life expectancy since for some hours I thought I've lost my entire pool.

But was in face the race condition within the lagg driver.

According to this FreeBSD bugzilla, in fact, there's an issue with the implementation regarding the epoch session of lagg interfaces: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240609

I've been scrartching my head the entire christmas trying to find at least an workarround since it was working without any issue with FreeNAS 11.2-RELEASE.

I finally had an idea to just boot the system with 11.2-RELEASE so the interfaces will came up and, if this is fact was a race condition, the network cards and the switch will be in a state that will not trigger the race condition in the if_lagg.c file. And in fact this idea worked. The system is now online in 12.0-U1.

But I know if I just shutdown the system, removing it's power, the issue will come back and I'll be unable to boot the system.
I even took time to validade this against 12.0-RELEASE, and in fact it's affected.

After reading the TrueNAS OS code located here: https://github.com/freenas/os/blob/freenas/12-stable/sys/net/if_lagg.c

I can confirm that epoch timer is used with the driver right here:

#define LAGG_RLOCK() struct epoch_tracker lagg_et; epoch_enter_preempt(net_epoch_preempt, &lagg_et)
#define LAGG_RUNLOCK() epoch_exit_preempt(net_epoch_preempt, &lagg_et)

This file if_lagg.c seems extremely different from what we have on FreeBSD HEAD according to the bugzilla I've metioned in first place, which is strange.

Since LAGG and VLAN are basic things in enterprise, I'm marking this bug as CRITICAL.

I'm attaching the crash artifact as an image and a debug output of the system if needed.

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

Vinícius Ferrão March 9, 2021 at 11:58 PM

Hi I should have explicit mentioned that I was only able to reproduce the issue on the debug kernel too. But with the living hell of I was panicking.

Looking forward for the 12.0-U3 release.

Alexander Motin March 9, 2021 at 10:43 PM

I've reproduced the issue with debug kernel, merged the mentioned commit, and then confirmed that it solved the problem.  I think we haven't seen many reports like that because not many enterprise customers using vlans over lagg over 1Gb/s Intel NICs are running debug kernel, while non-debug didn't panic on this, at least for me.

Kris Moore January 12, 2021 at 8:08 PM

We typically set the priority after we get a chance to really dig into the code / problem and understand the impact. I've reviewed it now and set the priority to higher. 

Vinícius Ferrão December 28, 2020 at 7:56 PM
Edited

Thanks for confirming this. Let me ask how iX handles this. I see that you marked a fix for 12.0-U2. TrueNAS will have the mentioned patch even if upstream FreeBSD didn’t apply it? The second question is just a curiosity, how priority is defined for the issues. I see that by default it’s marked as low on Jira but it haven’t changed since you screened the issue. In my understanding this issue should be almost blocking, since it surely breaks enterprise systems.

Thank you.

Alexander Motin December 28, 2020 at 2:36 PM

FreeBSD head got more epoch(9)-related changed that are not merged to 12, that probably explains the difference in the code you see.  But on a brief look it seems like patch https://svnweb.freebsd.org/changeset/base/368448 mentioned in the above ticket should be viable for 12 too if applied manually.

Complete

Details

Assignee

Reporter

Labels

Impact

Critical

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

Created December 26, 2020 at 6:11 PM
Updated July 1, 2022 at 4:59 PM
Resolved March 9, 2021 at 10:43 PM