Uploaded image for project: 'TrueNAS'
  1. TrueNAS
  2. NAS-108980

ZFS data corruption and random reboots during scrubs on TrueNAS12+

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Engineering Closed (View Workflow)
    • Priority: Low
    • Resolution: Cannot Reproduce
    • Affects Version/s: 12.0-U1
    • Fix Version/s: N/A
    • Component/s: OS

      Description

      Hi all,

      Since 28 November I’ve been going through my worst hardware troubleshooting nightmare ever. I hope someone here can help me get to the bottom of this.

      The hardware / software
      ---------------------------------------

      The problem occurs with my 1 year old FreeNAS / TrueNAS server, which has following hardware:

      OS: FreeNAS11U3 until TrueNAS 12U1 (the upgrade to TrueNAS12 was somewhere beginning of November)
      Case: Fractal Design Define R6 USB-C
      PSU: Fractal Design ION+ 660W Platinum
      Mobo: ASRock Rack X470D4U2-2T
      NIC: Intel X550-AT2 (onboard)
      CPU: AMD Ryzen 5 3600
      RAM: 32GB DDR4 ECC (2x Kingston KSM26ED8/16ME) which was replaced by 64GB DDR4 ECC (2x Samsung M391A4G43MB1-CTD) in the beginning of November
      HBA: Dell H310 ( = LSI SAS 9211-8i) which was replaced by LSI SAS 9207-8i in the beginning of 2021 as part of the troubleshooting
      HDDs: 8x WD Ultrastar DC HC510 10TB (RAID-Z2)
      Boot disk: Intel Postville X25-M 160GB
      SLOG: Intel Optane 900P 280GB (added beginning of October)​
      This server ran for almost a year without issues, until 28 November, and was properly burned-in by using Memtest86, Prime95, solnet-array-test, badblocks, etc…

      The (horror) issue
      -----------------------------

      I have monthly scrubs scheduled on my server on the 28th of the month. On 28 November I got 6 reboots during the scrub-run and it produced 8 CKSUM errors on 5 different HDDs. Before this, I never had issues with scrub…
      I’ve thoroughly checked the logs and nothing can be found regarding the reboots or data corruption.

      Recent hardware / software changes
      -----------------------------------------------------------

      As you could read earlier, I have replaced the RAM and upgraded from FreeNAS11 to TrueNAS12 in the same month as the errors started occuring. One month before that, I also added an Intel Optane as SLOG (but this change did have a problem-free scrub on 28 October).
      So those are of course “likely suspects”…
      A downgrade of TrueNAS12 back to FreeNAS11 isn’t easy, as I’ve already upgraded my pools flags, so I would need to destroy my pool and restore all data from my (non-redundant) offline backup (which is something I’m hesitant to do).

      Something weird about the reboots
      ---------------------------------------------------------

      The reboots occur “in pairs” with about 4m30s / 4m35s in between. I’ve discovered that I can “prevent” the 2nd reboot by manually rebooting before 4m30s / 4m35s pass.
      The 2nd reboot also occurs when the pool is still locked (I have encryption on my pool). So while nothing or hardly anything actually happens with the pool.

      What I’ve tried so far (unsuccessful)
      ----------------------------------------------------------

      • Since 28 November I’ve run about 30-50 scrubs for my troubleshooting, which take about 12 hours per pass. For each thing I try, I need to run scrub at least twice, because the first run, it can still find CKSUM errors that were created before the thing I’ve changed.
      • At first I wasn’t able to re-produce the reboots. Since mid December I’ve discovered that setting sync=always without the Optane SLOG increases the chance of reboots. But still the reboots occur very random (but always in pairs).
      • In Windows, I’ve ran an extended-self-test on all 8 HDDs simultaneously, while also running Prime95 blended at the same time. This should stress the HDDs, the PSU, the CPU, the RAM all at the same time. No reboots occured and all HDDs were found to be 100% healthy.
      • Also the SMART values of all HDDs are 100% healthy.
      • I’ve confirmed that Memory ECC reporting properly works with the 64GB RAM (I already confirmed this with my 32GB RAM before the issues started occuring, but, just to make sure, I reconfirmed it also with the 64GB RAM).
      • I underclocked the RAM (ran it below spec). This didn’t help.
      • I’ve tried triggering the issue on my Intel Optane by creating a single disk pool on the Optane, constantly running scrub / writing data to it, for a whole day. I couldn’t reproduce the issue on the Intel Optane. This shifted my suspicion to the HBA, as the issue only seemed to occur when using a pool on the HBA.
      • I’ve completely removed the Intel Optane from my server. This also did not solve the issue. (here I did discover the impact of sync=always on the likelihood of reboots)
      • I’ve forced “Power Supply Idle Control” to “Typical Current Idle” in the BIOS and confirmed that CPU C-States already were disabled. Also this did not help.
      • I’ve upgraded my BIOS and the IPMI and upgraded from TrueNAS12 to TrueNAS12U1. Also this did not help.
      • I’ve re-inserted the HBA and re-attached the SFF-8087 cable, both on the HDD and HBA side. Didn’t help.
      • I’ve (temporarely) added a screamingly loud Delta datacenter 120mm fan. My wife almost kicked me out of the house because of the noise, but, as it again didn’t help, I’m quite sure it is not temperature related.
      • I’ve (temporarely) replaced the SFF-8087 cable with an old one. Didn’t help.
      • I’ve (temporarely) replaced the PSU with an old 850W Seagate PSU. Didn’t help.
      • I’ve bought a new HBA (LSI 9207-8i instead of Dell H310), installed an Intel CPU cooler on it, to make sure it is properly cooled and tried again. Didn’t help.

      What I’ve tried so far (slightly successful)
      ------------------------------------------------------------------

      • I’ve moved the HBA from PCI-e slot 6 to PCI-e slot 4. Both of these slots come directly from the CPU (not from the Chipset), so it shouldn’t really matter, but it did make a difference… There were noticeably less CKSUM errors during the many scrub runs I’ve tried. But still some CKSUM errors did occur (about 1 per run and once even 0). So I’m getting closer (finally :slight_smile: )
      • I’ve replaced the Ryzen 3600 in my server, with my desktops Ryzen 3900XT and ran scrub 3x without errors with the HBA in slot 6 and 2x without errors with the HBA in slot 4. So it seems like my issue is related to my CPU! I also discovered (I think) a tiny little bit of thermal paste covering about 2-3 pins on the side of my Ryzen 3600 CPU. So with a soft brush and lots of patience / IPA I cleaned it off.
        Finally, I re-inserted my cleaned Ryzen 3600 and ran scrub some more. The first time it completed without CKSUM errors, but the 2nd run, I again got 1 error. So although it certainly seems better than before, my problem still isn’t solved :frowning:

      Conclusions so far and questions
      ------------------------------------------------------

      • It seems like the issue is related to PCI-e / the CPU
      • It could be related to the thermal paste covering the pins, but then it is still very strange that
      • the problem only started occurring after almost a year
      • cleaning it didn’t help to solve it completely
      • When googling for PCI-e errors, I found 2 “remarkable” things in other reports related to PCIe errors:
      • PCI-e errors should be detected
      • PCI-e errors can even be corrected sometimes
      • That the problem didn’t occur with my Intel Optane, but only with the HBA, could be related to the Optane being in PCI-e slot 4, while the HBA was in PCI-e slot 6
      • It still blows my mind that the reboots occur in pairs. This smells like a “software-issue”, but everything else clearly points at a “hardware-issue”. This is the main reason that I create a bug report for this. The fact that NOTHING is in the logs regarding these reboots, seems like a serious bug to me. It can't be a hardware-only issue with those "reboots-in-pairs".
      • Also I wonder why I'm not seeing PCIe errors. I've just confirmed that AER is enabled in my BIOS and in Linux I can clearly see that it is also enabled by the OS:

      -bash-5.1# dmesg |grep -i aer
      [ 0.869870] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
      [ 1.215562] pcieport 0000:00:01.1: AER: enabled with IRQ 26
      [ 1.215726] pcieport 0000:00:01.3: AER: enabled with IRQ 27
      [ 1.215876] pcieport 0000:00:03.1: AER: enabled with IRQ 28
      [ 1.216020] pcieport 0000:00:03.2: AER: enabled with IRQ 29
      [ 1.216215] pcieport 0000:00:07.1: AER: enabled with IRQ 31
      [ 1.216346] pcieport 0000:00:08.1: AER: enabled with IRQ 32
      [ 1.216489] pcieport 0000:00:08.2: AER: enabled with IRQ 33
      [ 1.216630] pcieport 0000:00:08.3: AER: enabled with IRQ 34

      On TrueNAS, I see no such thing in dmesg. Does TrueNAS support AER?

        Attachments

          Attachments

            JEditor

              Activity

                People

                Assignee:
                releng Triage Team
                Reporter:
                Mastakilla Mastakilla
                Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                  Dates

                  Created:
                  Updated:
                  Resolved: