Failed IPMI causes watchdogd to hang UI
Description
Problem/Justification
Impact
Activity
Alexander Motin August 3, 2021 at 6:23 PM
I've merged the patched into TN 12.0-U6. Into 13.0 they should get via FreeBSD's MFC in some time.
gcs8 July 30, 2021 at 4:09 AM
Sweet, sounds like that will work out better than my idea of hitting processes with a bat till they act right. Thanks for the hard work.
Alexander Motin July 30, 2021 at 3:57 AM
This is what I am talking about:
https://cgit.freebsd.org/src/commit/?id=9d3b47abbba74830661e90206cc0f692b159c432
https://cgit.freebsd.org/src/commit/?id=74f80bc1af2ffd56ec290f610c80e46f768731a0
First patch fixes watchdog operation after BMC reset, while second makes `sysctl dev.cpu` to not stuck while IPMI driver is spinning for BMC response. I've tested them by manually resetting BMC via the IPMI driver itself.
Alexander Motin July 30, 2021 at 2:43 AM
I am still working on the problem. I already have one prospect patch for the IPMI driver to slightly improve error handling, which I found insufficient, but I think final solution may end up different. I suspect middleware is getting unresponsive due to attempt to read temperature from CPUs blocked by IPMI driver. It is a known problem, just with different kind of a CPU blocker. I think I have an idea how to fix it from coretemp(4) side. I.e. IPMI spinning may still eat one CPU thread, but WebUI will not suffer so much.
gcs8 July 30, 2021 at 12:15 AM
So, what do you propose as a solution for systems where the BCM goes sideways from crippling a system? Is there some counter that can be incremented but cleared on reboot that after X failures in X time stop trying to query the BCM? I get that this is an edge case, but it takes ~4 hours to a couple of days to get a new motherboard or longer with the silicon shortages going on. I can't be the only one who thinks the system being crippled during that time is a problem.
I am just asking if there is a non pain in the ass way to have a self-healing failsafe, and maybe stop the email alert every five min when it fails.
I had the IMPI on my SM MoBo die and when it did it seems to have rebooted the server, ~00:57:21 is when it came back up.
I noticed that while the IPMI is dead it causes the middleware to freak out, fails to load data, fails to load pages in a timely manner, etc...
I was able to get use of the system back by stopping and disabling watchdogd.