System hangs frequently when replication runs
Description
Problem/Justification
Impact
SmartDraw Connector
Katalon Manual Tests (BETA)
Activity

William Gryzbowski May 14, 2021 at 4:40 PM
Since there was no feedback so far I am closing this.
If it happens please let us know and we will reopen it.

Josh Wisely May 7, 2021 at 8:35 PM
U3 allowed me to generate debugs successfully.
With fixes from https://jira.ixsystems.com/browse/NAS-110234 and https://jira.ixsystems.com/browse/NAS-109705 the system hasn't hanged again yet.
When/if it does, I'll generate a debug and upload it here.

William Gryzbowski May 7, 2021 at 7:14 PM
U3.1 is out and should allow you to save the debug.
Can you verify and upload the debug, please?

Josh Wisely April 5, 2021 at 9:17 PM
Sadly saving a debug from the GUI isn't possible due to this bug: https://jira.ixsystems.com/browse/NAS-109706
If you can provide a command line way to gather the same info I'm happy to try to gather it.
By slowing down the replication jobs by spacing them at least 30m apart, this seems to be mitigated. This may be related to whatever is causing this: bug https://jira.ixsystems.com/browse/NAS-109705
I'll try again after U3 is release and I've upgraded to it.

Alexander Motin April 5, 2021 at 7:41 PM
All of your debugs from March are made with `freenas-debug -A`, not with System -> Advanced -> Save Debug in the WebUI. So they include neither logs now kernel dumps. Thanks for the description of the scenario, but we are quite limited on time to spend several days to set up alike environment and try to reproduce the problem, that may not even succeed. Please see whether your system is able to dump the core or it prints something on console when it happen. I also proposed to enable debug kernel to try collect any additional information.
Replication is between 2 almost identical (only the RAM amount is different) TrueNAS boxes. Replication is set to run every hour. The system hangs when trying to run replication a few times a day. The system does not reboot and have to be manually power cycled. This started immediately after setting up replication.