Hello, one of my systems have a lagg interface with four gigabit ethernet cards with VLANs on top of it. After a reboot on the system, the system never came back online crashing during boot, just after importing the pool. Which probabily reduced my life expectancy since for some hours I thought I've lost my entire pool.
But was in face the race condition within the lagg driver.
According to this FreeBSD bugzilla, in fact, there's an issue with the implementation regarding the epoch session of lagg interfaces: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240609
I've been scrartching my head the entire christmas trying to find at least an workarround since it was working without any issue with FreeNAS 11.2-RELEASE.
I finally had an idea to just boot the system with 11.2-RELEASE so the interfaces will came up and, if this is fact was a race condition, the network cards and the switch will be in a state that will not trigger the race condition in the if_lagg.c file. And in fact this idea worked. The system is now online in 12.0-U1.
But I know if I just shutdown the system, removing it's power, the issue will come back and I'll be unable to boot the system.
I even took time to validade this against 12.0-RELEASE, and in fact it's affected.
After reading the TrueNAS OS code located here: https://github.com/freenas/os/blob/freenas/12-stable/sys/net/if_lagg.c
I can confirm that epoch timer is used with the driver right here:
#define LAGG_RLOCK() struct epoch_tracker lagg_et; epoch_enter_preempt(net_epoch_preempt, &lagg_et) #define LAGG_RUNLOCK() epoch_exit_preempt(net_epoch_preempt, &lagg_et)
This file if_lagg.c seems extremely different from what we have on FreeBSD HEAD according to the bugzilla I've metioned in first place, which is strange.
Since LAGG and VLAN are basic things in enterprise, I'm marking this bug as CRITICAL.
I'm attaching the crash artifact as an image and a debug output of the system if needed.