Boot Issues with TrueNAS-SCALE-22.02.0.1
Description
Problem/Justification
Impact
Attachments
- 08 Jul 2022, 07:53 PM
- 08 Jul 2022, 07:48 PM
- 08 Jul 2022, 04:15 PM
- 08 Jul 2022, 04:15 PM
- 07 Jul 2022, 02:49 PM
Activity
Matthew Bettencourt December 15, 2022 at 6:05 PM
I retested enabling serial console output however I still ran into the same issue. With serial console output enabled the system would not boot.
Matthew Bettencourt December 13, 2022 at 7:52 PM
It looks like the first Bluefin 22.12.0 was released today. I will work on updating my system and re-testing the boot issue and report back.
Umer Saleem August 23, 2022 at 6:20 PM
Fixed under NAS-114289.
Matthew Bettencourt July 10, 2022 at 7:00 PM
Hello @Umer Saleem ,
I have tired booting the system with a variety of BIOS and TN-Scale serial configurations:
BIOS Serial Output | TN-Scale Console | Result |
---|---|---|
Disabled | Enabled (115200 baud rate) | Same result, won’t boot |
Enabled (115200 baud rate) | Enabled (115200 baud rate) | Same result, won’t boot |
Enabled (115200 baud rate) | Disable | System boots |
Disabled | Disabled | System boots |
I have not tried lowering the baud rate to 9600, I can attempt that when I get another chance.
Thanks,
Matthew
Umer Saleem July 10, 2022 at 2:54 PM(edited)
Hi @Matthew Bettencourt, thank you for sharing the serial output and debug files. I have been going through them. I think this has something to do with BIOS settings, particularly how BIOS is initializing and using the serial port.
In the serial output, we see the following message:
Press <ESC> to view diagnostic messagesPress <F1> to enter setup, <F12> Boot Menu, <F10> Network Boot
This makes me think that BIOS has also enabled the serial console, which could be problematic for us if Linux tries to set up a serial console a bit differently from how BIOS has done it.
If we go a bit further in the serial output, we see that it took around 190 seconds to enable the serial console after it has already enabled the console on tty0
:
[ 0.105446] random: crng init done
[ 0.111746] Console: colour VGA+ 80x25
[ 0.134217] printk: console [tty1] enabled
[ 190.481599] printk: console [ttyS0] enabled
[ 191.120846] ACPI: Core revision 20200925
[ 191.652358] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
[ 193.070951] APIC: Switch to symmetric I/O mode setup
After enabling the serial console, the overall system performance has been really slow and at times it gets stuck at some place. In the first backtrace that we get, we can see that system was indeed trying to write something to the console:
[ 757.452607] Call Trace:
[ 757.452608] vsnprintf+0x6e/0x4f0
[ 757.452608] sprintf+0x56/0x70
[ 757.452608] info_print_prefix+0x7b/0xd0
[ 757.452609] record_print_text+0x58/0x150
[ 757.452609] console_unlock+0x177/0x530
[ 757.452609] vprintk_emit+0x208/0x250
[ 757.452610] printk+0x58/0x6f
[ 757.452610] _base_make_ioc_operational.cold+0x6d7/0xd3f [mpt3sas]
[ 757.452610] ? mpt3sas_base_attach.cold+0x13f0/0x15ec [mpt3sas]
[ 757.452611] ? __kmalloc+0x111/0x250
[ 757.452611] mpt3sas_base_attach.cold+0x158f/0x15ec [mpt3sas]
[ 757.452611] _scsih_probe+0x66a/0x820 [mpt3sas]
[ 757.452612] local_pci_probe+0x42/0x80
[ 757.452612] ? _cond_resched+0x16/0x40
[ 757.452612] pci_device_probe+0xfd/0x1b0
[ 757.452613] really_probe+0x222/0x480
[ 757.452613] driver_probe_device+0xe1/0x150
[ 757.452613] device_driver_attach+0xa1/0xb0
[ 757.452614] __driver_attach+0x8a/0x150
[ 757.452614] ? device_driver_attach+0xb0/0xb0
[ 757.452614] ? device_driver_attach+0xb0/0xb0
[ 757.452615] bus_for_each_dev+0x78/0xc0
[ 757.452615] bus_add_driver+0x12b/0x1e0
[ 757.452615] driver_register+0x8b/0xe0
[ 757.452615] ? 0xffffffffc0562000
[ 757.452616] _mpt3sas_init+0x1ac/0x1000 [mpt3sas]
[ 757.452616] do_one_initcall+0x44/0x1d0
[ 757.452616] ? do_init_module+0x23/0x240
[ 757.452617] ? kmem_cache_alloc_trace+0xf5/0x200
[ 757.452617] do_init_module+0x4c/0x240
[ 757.452617] __do_sys_finit_module+0xb1/0x110
[ 757.452618] do_syscall_64+0x33/0x80
[ 757.452618] entry_SYSCALL_64_after_hwframe+0x44/0xa9
After this point, we keep on getting the core dumps periodically, with the same message Sending NMI from CPU X to CPUs Y
, which I think are symptoms of the same problem.
We can try a couple of things here to help clarify the problem further:
Keep the serial console enabled from BIOS, remove the
console=ttyS0,115200
parameter from GRUB with serial cable connected andloglevel=8
parameter enabled and try to boot TN-SCALE. If we get any output on serial, this would confirm if BIOS is actually using the serial console for itself.Try to disable the serial console set up by BIOS. There should be some setting in the BIOS that allows to disable the serial console from BIOS, or something that prohibits BIOS from using the serial port. Try to boot TN-SCALE after disabling serial console from BIOS with
console=ttyS0,115200
andloglevel=8
parameters enabled.Try changing the baudrate to
9600
. Update theconsole=ttyS0,115200
parameter toconsole=ttyS0,9600
withloglevel=8
parameter and try to boot TN-SCALE with serial console from BIOS enabled. In the installer image for TN-SCALE, we use theconsole=ttyS0,9600
parameter in the default boot option. If we don't see the problem with9600
baudrate, this might be a clue.The second boot option in installer image is with
115200
baud rate. We should try that boot option as well to see if the problem occurs there as well.
I am having two boot issues with a fresh install of Truenas Scale on my Lenovo TD340.
System info:
Lenovo ThinkServer TD340 bios: A3TSF5A
MoBo: L32TT2 (Lenovo)
CPU: Intel Xeon E5-2420 v2
Memory: 48G DDR3 1333 ECC
Boot Drive: Crucial CT256M4SSD2
NIC: AOC-STGN-I2S (Intel 82599)
1. Booting with SR-IOV enabled: When I had SR-IOV enabled in the BIOS the Truenas system would never fully boot. It booted to GRUB and then started the Truenas bootup process however it would go into what appeared to some sort of loop in the bootup process. I left the bootup process running for about an hour and never fully booted It constantly showed the same error messages:
May 1 19:18:33 truenas kernel: Sending NMI from CPU 6 to CPUs 2:
May 1 19:18:33 truenas kernel: NMI backtrace for cpu 2
May 1 19:18:33 truenas kernel: CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.10.93+truenas #1
May 1 19:18:33 truenas kernel: Hardware name: LENOVO ThinkServer TD340 /ThinkServer TD340, BIOS A3TSF5A 09/11/2020
May 1 19:18:33 truenas kernel: RIP: 0010:io_serial_in+0x14/0x20
May 1 19:18:33 truenas kernel: Code: 00 00 d3 e6 48 63 f6 48 03 77 10 8b 06 c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 0f b6 8f b9 00 00 00 8b 57 08 d3 e6 01 f2 ec <0f> b6 c0 c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 0f b6 8f b9 00
May 1 19:18:33 truenas kernel: RSP: 0000:ffffb467400239d8 EFLAGS: 00000002
May 1 19:18:33 truenas kernel: RAX: ffffffffb7be5700 RBX: ffffffffb95f5220 RCX: 0000000000000000
May 1 19:18:33 truenas kernel: RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffffb95f5220
May 1 19:18:33 truenas kernel: RBP: 0000000000001eec R08: 0000000000000002 R09: 0000000000000882
May 1 19:18:33 truenas kernel: R10: 000000000000075d R11: 207375625f696370 R12: 0000000000000020
May 1 19:18:33 truenas kernel: R13: ffffffffb94ea3ee R14: 0000000000000001 R15: 0000000000000000
May 1 19:18:33 truenas kernel: FS: 0000000000000000(0000) GS:ffff9fbcffa80000(0000) knlGS:0000000000000000
May 1 19:18:33 truenas kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 1 19:18:33 truenas kernel: CR2: 0000000000000000 CR3: 000000072520a001 CR4: 00000000001706e0
May 1 19:18:33 truenas kernel: Call Trace:
May 1 19:18:33 truenas kernel: wait_for_xmitr+0x40/0xb0
May 1 19:18:33 truenas kernel: serial8250_console_putchar+0x18/0x30
May 1 19:18:33 truenas kernel: ? wait_for_xmitr+0xb0/0xb0
May 1 19:18:33 truenas kernel: uart_console_write+0x43/0x50
May 1 19:18:33 truenas kernel: serial8250_console_write+0x300/0x380
May 1 19:18:33 truenas kernel: ? vt_console_print+0x2bb/0x3f0
May 1 19:18:33 truenas kernel: console_unlock+0x3c6/0x530
May 1 19:18:33 truenas kernel: vprintk_emit+0x208/0x250
May 1 19:18:33 truenas kernel: dev_vprintk_emit+0x12c/0x150
May 1 19:18:33 truenas kernel: dev_printk_emit+0x4e/0x65
May 1 19:18:33 truenas kernel: ? __dev_printk+0x2d/0x69
May 1 19:18:33 truenas kernel: _dev_info+0x6c/0x83
May 1 19:18:33 truenas kernel: pci_bus_dump_resources.cold+0x14/0x19
May 1 19:18:33 truenas kernel: pci_bus_dump_resources+0x5b/0x70
May 1 19:18:33 truenas kernel: pci_assign_unassigned_root_bus_resources+0xfd/0x1c0
May 1 19:18:33 truenas kernel: pci_assign_unassigned_resources+0x1f/0x7c
May 1 19:18:33 truenas kernel: pcibios_assign_resources+0x1b/0xcd
May 1 19:18:33 truenas kernel: ? xsk_init+0xbe/0xbe
May 1 19:18:33 truenas kernel: do_one_initcall+0x44/0x1d0
May 1 19:18:33 truenas kernel: kernel_init_freeable+0x21e/0x280
May 1 19:18:33 truenas kernel: ? rest_init+0xb4/0xb4
May 1 19:18:33 truenas kernel: kernel_init+0xa/0x10c
May 1 19:18:33 truenas kernel: ret_from_fork+0x22/0x30
May 1 19:18:33 truenas kernel: pci_bus 0000:0c: resource 10 [mem 0x80000000-0xfbffffff window]
May 1 19:18:33 truenas kernel: rcu: rcu_sched kthread starved for 706 jiffies! g-1119 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=1
2. Long boot up times with SR-IOV disabled: The other issue I have am having is what I think would be considered long bootup times. If I time the bootup process from the moment GRUB starts to when the web gui is available the boot times is usually in the 20-25 minute range.
If these two issues need to be slit into seperate JIRAs let me know.
I also tested Truenas Bluefin 22.12-master-20220429-150401 version and had the same results.
I did some additional testing boot environments for a sanity check and they all seem to be working as expected.
1. Debian 11.3 with SR-IOV Enabled: It was just a base system so not many packages had to load, but at least I know linux can boot correctly with SR-IOV enabled.
2. Truenas Core 12.0-U8 SR-IOV Enabled: System booted successfully, and was well under 2 minutes until the system was fully booted.
I will attach debug files from both Truenas installs I did 22.02.0.1 and 22.15. Currently the system is not in production so I can preform additional tests if required.