System freezing/not restarting after kernel bugs

Description

My system just randomly got a page fault somewhere in kernel land. The only thing I had running were some virtual machines (mostly idle) and a resilver on one of my zpools.

I don't know if there is anything that can be done about the kernel bug, but I'd like to see truenas automatically restart when this happens. I'm not always on-site and dont have ipmi on this system so if I'm not around, the system would stay offline until someone presses the reset button.

This is what I could find in kern.log

Nov 4 22:20:45 truenas kernel: BUG: unable to handle page fault for address: 000000000002a5c0
Nov 4 22:20:45 truenas kernel: #PF: supervisor read access in kernel mode
Nov 4 22:20:45 truenas kernel: #PF: error_code(0x0000) - not-present page
Nov 4 22:20:45 truenas kernel: PGD 0 P4D 0
Nov 4 22:20:45 truenas kernel: Oops: 0000 #1 SMP NOPTI
Nov 4 22:20:45 truenas kernel: CPU: 2 PID: 2925621 Comm: threadpool_ws Tainted: P OE 5.10.70+truenas #1
Nov 4 22:20:45 truenas kernel: Hardware name: Micro-Star International Co., Ltd. MS-7A38/B450M PRO-VDH MAX (MS-7A38), BIOS B.40 11/07/2019
Nov 4 22:20:45 truenas kernel: RIP: 0010:page_remove_rmap+0x7c/0x4c0
Nov 4 22:20:45 truenas kernel: Code: ff 83 e2 01 48 8b 55 00 48 0f 44 c5 48 c1 ea 36 48 8b 40 38 48 8b 3c d5 a0 e4 66 8b 48 85 c0 0f 84 22 03 00 00 0f 1f 44 00 00 <48> 63 97 c0 a5 02 00 4c 8b 84 d0 00 0b 00 00 49 3b b8 80 00 00 00
Nov 4 22:20:45 truenas kernel: RSP: 0018:ffffadfd96bf7b20 EFLAGS: 00010286
Nov 4 22:20:45 truenas kernel: RAX: ffff890cc25f5000 RBX: 0000000000000000 RCX: 0000000000000001
Nov 4 22:20:45 truenas kernel: RDX: 00000000000003c0 RSI: 0000000000000000 RDI: 0000000000000000
Nov 4 22:20:45 truenas kernel: RBP: fffff3b18458f100 R08: fffff3b18458f100 R09: ffffffffffffffff
Nov 4 22:20:45 truenas kernel: R10: ffffffffffffffe6 R11: 0000000000000001 R12: 00007fae988cb000
Nov 4 22:20:45 truenas kernel: R13: ffffadfd96bf7c80 R14: 00007fae9886a000 R15: 00007fae98869000
Nov 4 22:20:45 truenas kernel: FS: 00007fae5163a700(0000) GS:ffff89130e680000(0000) knlGS:0000000000000000
Nov 4 22:20:45 truenas kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 4 22:20:45 truenas kernel: CR2: 000000000002a5c0 CR3: 000000040a0dc000 CR4: 00000000003506e0
Nov 4 22:20:45 truenas kernel: Call Trace:
Nov 4 22:20:45 truenas kernel: unmap_page_range+0x584/0xc50
Nov 4 22:20:45 truenas kernel: unmap_vmas+0x78/0xf0
Nov 4 22:20:45 truenas kernel: exit_mmap+0xad/0x1a0
Nov 4 22:20:45 truenas kernel: mmput+0x56/0x130
Nov 4 22:20:45 truenas kernel: begin_new_exec+0x383/0x960
Nov 4 22:20:45 truenas kernel: load_elf_binary+0x6e3/0x15d0
Nov 4 22:20:45 truenas kernel: ? tomoyo_find_next_domain+0x268/0x860
Nov 4 22:20:45 truenas kernel: bprm_execve+0x2ed/0x6b0
Nov 4 22:20:45 truenas kernel: do_execveat_common+0x192/0x1c0
Nov 4 22:20:45 truenas kernel: __x64_sys_execve+0x39/0x50
Nov 4 22:20:45 truenas kernel: do_syscall_64+0x33/0x80
Nov 4 22:20:45 truenas kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 4 22:20:45 truenas kernel: RIP: 0033:0x7fae989ee6c7
Nov 4 22:20:45 truenas kernel: Code: Unable to access opcode bytes at RIP 0x7fae989ee69d.
Nov 4 22:20:45 truenas kernel: RSP: 002b:00007fae51638118 EFLAGS: 00000246 ORIG_RAX: 000000000000003b
Nov 4 22:20:45 truenas kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fae989ee6c7
Nov 4 22:20:45 truenas kernel: RDX: 0000000003b4d400 RSI: 00007fae703e2270 RDI: 00007fae52a4a290
Nov 4 22:20:45 truenas kernel: RBP: 00007fae703e2270 R08: 0000000000000003 R09: 00007fae00000000
Nov 4 22:20:45 truenas kernel: R10: fffffffffffff201 R11: 0000000000000246 R12: 0000000000000003
Nov 4 22:20:45 truenas kernel: R13: 00007fae716df070 R14: 0000000000000000 R15: 0000000000000040
Nov 4 22:20:45 truenas kernel: Modules linked in: ip_set(E) nfnetlink(E) ip_vs(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) vhost_net(E) vhost(E) vhost_iotlb(E) tap(E) tun(E) binfmt_misc(E) ipmi_devintf(E) ipmi_msghandler(E) essiv(E) authenc(E) dm_crypt(E) dm_mod(E) bridge(E) stp(E) llc(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) ledtrig_audio(E) snd_hda_codec_hdmi(E) snd_hda_intel(E) snd_intel_dspcfg(E) soundwire_intel(E) edac_mce_amd(E) soundwire_generic_allocation(E) kvm_amd(E) snd_soc_core(E) snd_compress(E) soundwire_cadence(E) amdgpu(E) snd_hda_codec(E) kvm(E) snd_hda_core(E) ppdev(E) irqbypass(E) snd_hwdep(E) soundwire_bus(E) gpu_sched(E) rapl(E) ttm(E) snd_pcm(E) drm_kms_helper(E) snd_timer(E) pcspkr(E) cec(E) snd(E) ccp(E) efi_pstore(E) wmi_bmof(E) k10temp(E) sp5100_tco(E) parport_pc(E) watchdog(E) soundcore(E) rng_core(E) i2c_algo_bit(E) parport(E) sg(E) evdev(E) button(E) ntb_netdev(E) ntb_transport(E) ntb(E) ioatdma(E) dca(E) fuse(E) drm(E) configfs(E) efivarfs(E) ip_tables(E)
Nov 4 22:20:45 truenas kernel: x_tables(E) autofs4(E) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) raid1(E) raid0(E) multipath(E) linear(E) md_mod(E) sd_mod(E) t10_pi(E) crc_t10dif(E) crct10dif_generic(E) crct10dif_pclmul(E) crct10dif_common(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) ahci(E) libahci(E) aesni_intel(E) libaes(E) crypto_simd(E) xhci_pci(E) cryptd(E) glue_helper(E) libata(E) xhci_hcd(E) r8169(E) realtek(E) mdio_devres(E) i2c_piix4(E) libphy(E) usbcore(E) scsi_mod(E) gpio_amdpt(E) wmi(E) video(E) gpio_generic(E)
Nov 4 22:20:45 truenas kernel: CR2: 000000000002a5c0
Nov 4 22:20:45 truenas kernel: ---[ end trace c3977ac4f3e252dc ]---
Nov 4 22:20:45 truenas kernel: RIP: 0010:page_remove_rmap+0x7c/0x4c0
Nov 4 22:20:45 truenas kernel: Code: ff 83 e2 01 48 8b 55 00 48 0f 44 c5 48 c1 ea 36 48 8b 40 38 48 8b 3c d5 a0 e4 66 8b 48 85 c0 0f 84 22 03 00 00 0f 1f 44 00 00 <48> 63 97 c0 a5 02 00 4c 8b 84 d0 00 0b 00 00 49 3b b8 80 00 00 00
Nov 4 22:20:45 truenas kernel: RSP: 0018:ffffadfd96bf7b20 EFLAGS: 00010286
Nov 4 22:20:45 truenas kernel: RAX: ffff890cc25f5000 RBX: 0000000000000000 RCX: 0000000000000001
Nov 4 22:20:45 truenas kernel: RDX: 00000000000003c0 RSI: 0000000000000000 RDI: 0000000000000000
Nov 4 22:20:45 truenas kernel: RBP: fffff3b18458f100 R08: fffff3b18458f100 R09: ffffffffffffffff
Nov 4 22:20:45 truenas kernel: R10: ffffffffffffffe6 R11: 0000000000000001 R12: 00007fae988cb000
Nov 4 22:20:45 truenas kernel: R13: ffffadfd96bf7c80 R14: 00007fae9886a000 R15: 00007fae98869000
Nov 4 22:20:45 truenas kernel: FS: 00007fae5163a700(0000) GS:ffff89130e680000(0000) knlGS:0000000000000000
Nov 4 22:20:45 truenas kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 4 22:20:45 truenas kernel: CR2: 00007fae989ee69d CR3: 000000040a0dc000 CR4: 00000000003506e0

On the local monitor I was able to capture this

I had one ssh session running to the server, it printed this before dying

2021 Nov 4 22:20:45 truenas BUG: unable to handle page fault for address: 000000000002a5c0
2021 Nov 4 22:20:45 truenas #PF: supervisor read access in kernel mode
2021 Nov 4 22:20:45 truenas #PF: error_code(0x0000) - not-present page

A debug log was generated right after I restarted the system (it wouldnt restart automatically, it was frozen), I'll attach that.

Problem/Justification

None

Impact

None

Activity

Umer Saleem 
January 11, 2022 at 12:42 PM

By default, Linux kernel will never reboot automatically after a panic. However, there are multiple ways to configure Linux to do so.

Add the following line to /etc/sysctl.conf:

Alternatively, write to /proc/sys/kernel/panic:

Here, N is the number of seconds you would want to wait after a panic before triggering a reboot to collect crash information. The recommended value is 30. The default value is 0, which disables system reboot after panic.

From the crash dump, we can see that this is an oops, which indicates a deviation from correct behavior of Linux Kernel. Usually, kernel oops might not cause a panic, unless kernel is configured to do so.

To enable panic on oops, add the following to /etc/sysctl.conf:

Alternatively, write to /proc/sys/kernel/panic_on_oops:

From the debug information attached, if we take a look at fndebug/Sysctl/dump.txt, it is evident that Linux kernel is not configured to reboot on panic, as kernel.panic and kernel.panic_on_oops variables are set to 0.

TrueNAS-SCALE 22.02-RC.1 with Linux kernel version 5.10.70+truenas was used here.

 

Alexander Motin 
January 10, 2022 at 8:34 PM

It is weird that the same system worked fine on Core, but NMI errors in other ticket obviously tells that something is not right there.  We'll investigate what to do with reboots after panics.

PolloLoco 
November 11, 2021 at 4:52 PM

Okay thats fair. I know that using such hardware isnt the best way to go. I only reported it because those issues came as soon as I installed scale rc1, it worked perfectly on freenas 11 and truenas core, thats why I assumed it had something to do with the new scale release.

But besides that, the reason why I reported this is because I don't think the system should freeze when such a bug occurs. Dumping callstacks and similar to kern.log sure, but then it should instantly reset. Is there any way to make it reset on these kind of bugs? Or is that something ixsystems would have to implement (or upstream linux)?

Alexander Motin 
November 11, 2021 at 4:15 AM

Same as I have told in other ticket, memory/hardware problems may cause all sorts of random errors. Please run some good long memory test.

Complete

Details

Assignee

Reporter

Labels

Impact

Time remaining

0m

Affects versions

Priority

Katalon Platform

Created November 4, 2021 at 9:39 PM
Updated July 1, 2022 at 5:52 PM
Resolved January 14, 2022 at 9:29 AM