The following system core files were found: python3.9.core.

Description

Summary

Found a warning in the Alerts section as follows:

The following system core files were found: python3.9.core. Please create a ticket at https://jira.ixsystems.com/ and attach the relevant core files along with a system debug. Once the core files have been archived and attached to the ticket, they may be removed by running the following command in shell: 'rm /var/db/system/cores/*'.

I believe this is caused by a replication kernel panic. The replication is done between two local pools (RAIDZ1 -> single drive stripe) but I haven't been able to catch the dump. I can try to capture a screenshot if that would help - it seems I can reliably reproduce it.

I am attaching the file as indicated above.

Problem/Justification

None

Impact

None

Activity

Dxun 
December 2, 2021 at 3:30 PM

Thanks for the feedback - I've been able to dig that core tar out and have re-attached it if you'd like to do more inspection?

Caleb 
December 2, 2021 at 1:13 PM

  thanks for the information and the screenshots. It's been very helpful. Unfortunately, the rar archive that you provided of the core dump seems to be corrupt. It has probably been truncated or was never a complete core dump file. Anyways, the other information you provided makes me suspect that this is fixed in 12.0-U7. I just fixed a ZFS bug where it was not handling inconsistent filesystems. (Inconsistent just means 1. the filesystem is in process of being destroyed or 2. it is in the process of being received). Please upgrade to that release one it's out and open a fresh ticket if you get any more core dumps.

Dxun 
November 30, 2021 at 2:49 PM

Attached new memtest86 PRO tests - both machines ran 32 passes, all tests (including ECC error injections) are green. From what I understand, that should conclusively prove memory is working nominally as 32 passes guarantee exhausting the whole set of possible value combinations a given memory cell can have.

Dxun 
November 28, 2021 at 3:28 PM
(edited)

After a full week of troubleshooting, and at least 10 failed replications (of which at least 70% left the target pool in unusable state and 1 even managed to almost render the source pool unusable as well) I was able to replicate the pool to another machine - but not in a single pass. I was able to determine that the replication tasks seem to be crashing at particular points in replication, which led me to suspect a few datasets that seemed to have some (or all) of the above:

  • nested datasets of different record sizes (parent dataset record size 1 MB, child dataset record size 16 kB)

  • long paths (potentially exceeding 255 characters)

  • "special" characters in filenames (a few examples would include Japanese chars in file names, or a character "squared" (superscripted symbol "2")

I cannot claim decidedly that any or all of these might have caused the crashes - but I do observe I was able to transfer these datasets (in isolation) through replication tasks (both local and remote) after trimming the path lengths down or removing the "special" chars from filenames (potentially from directory names as well, but I haven't been that observant as I was trimming these).

After the replication was successful, I ran scrubs on all the pools that participated in this (there is one source pool [RAID-Z2] and two target pools [one striped mirror, and one single-drive]) - no errors found on any.

One additional note - 4 passes of memtest86 PRO (v9.2) are good on each machine. I am doing a full 32 pass on each machine to fully exclude any memory issues as I write - will update the ticket as these finish.

Dxun 
November 27, 2021 at 3:22 AM

Investigation continues - it seems some snapshots created yesterday are always tripping up the machine causing machine reboot. I have no idea what/how these are special but I was able to capture a screenshot of a recent kernel panic.

 

I was able to replicate 95% of the source pool and these small datasets seem to be causing some kind of issue with TrueNAS. I will continue the transfer with copying - this instability is not manageable.

Duplicate

Details

Assignee

Reporter

Labels

Impact

Time remaining

0m

Components

Fix versions

Affects versions

Priority

Katalon Platform

Created November 24, 2021 at 5:45 AM
Updated July 6, 2022 at 8:57 PM
Resolved December 2, 2021 at 1:13 PM