Uploaded image for project: 'TrueNAS'
  1. TrueNAS
  2. NAS-108627

Data corruption after TrueNAS upgrade

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done (View Workflow)
    • Priority: Blocker
    • Resolution: Complete
    • Affects Version/s: 12.0-U1, 12.0-RELEASE
    • Fix Version/s: 12.0-U1.1
    • Component/s: ZFS
    • Labels:
      None

      Description

      Copy and pasting from TrueNAS forums: https://www.truenas.com/community/threads/freenas-now-truenas-is-no-longer-stable.89445

      Subject: FreeNAS (now TrueNAS) is no longer stable

      Hello, I'm writing this with deeply concerns about TrueNAS/FreeNAS and the move that seemed a little bit irresponsible regarding quality and testing.

       

      I have three virtualization pools that relied on FreeNAS for years. One specifically is running since 2013, other 2014 and the newer one since 2016.

       

      On the 2014 pool, we've updated from FreeNAS 11.3-U5 to TrueNAS 12.0-RELEASE, 3 weeks ago, precisely on November 20th. Suddenly we started to discover extreme VM corruption within the XFS filesystem, everything was getting corrupted, including the filesystem superblocks, leading to the inability to recover from xfs_repair.

       

      We blamed everyone: we blamed the Hypervisor, in that case oVirt 4.4, we blamed the fabrics and network, since this pool is using a Cisco Catalyst 2960X as a core, which is not ideal, we blamed the XFS filesystem due to issues on writeback mode, we blamed everyone. We didn't even considered blaming TrueNAS. I've even opened a discussion within the hypervisor mailing lists, but nothing conclusive was found: https://lists.ovirt.org/archives/list/users@ovirt.org/message/2DVB4ULURXWJ5VGHX64FDUZW27F7DY3J

       

      So for next days we blamed mainly the network, since there's some packets dropped on the switch. We concluded that the load on the park, for whatever reasons, have increased and the drops could have causing the issue. A guy on the maililng list recommended falling back to NFSv3 as VM storage instead of NFSv4 due to weird things happening under load. We tried, the situation was better but the issue is still happening.

       

      In this monday, we've had maintenance on the pools from 2013 and 2016. So it's upgrade time. We upgraded both pools to TrueNAS 12.0-RELEASE. 12.0-U1 wasn't available yet.

       

      Everything went fine... but on this Thursday the mail server on the pool from 2013 went down with a disconnection on the iSCSI disk due to I/O errors. Well let's see what happened and the result was: the VM was completely trashed. Corruption on the filesystem, on the operational system, on the service and on the databases that held the mailboxes. Other VM's like a webserver are completely trashed too. So it's a disaster scenario.

       

      Regarding the pool from 2016, I've already detected in place XFS corruption in one VM. For safety measures everything was shut down.

       

      So what happened?

       

      All three pools have different equipments and software, but the only common denominator is the storage system, which was ranging from FreeNAS 11.1 to 11.3. The hypervisors are mixed: oVirt 4.3, oVirt 4.4 and XenServer 7.2; two of them uses iSCSI as the storage backend and one is with NFSv3. Hardware is completely different either, so as you can see. TrueNAS is the only piece that's equal.

       

      For now, I've upgraded everything from 12.0-RELEASE to 12.0-U1. In hope that this will fix this issues.

       

      I don't have any artifact to blame FreeNAS/TrueNAS, the only thing that I've is my word of what happened on those pools. I never had any issue with FreeNAS/TrueNAS for almost 8 years running it, but this move to 12.0 may seem rushed by iXsystems. There's no logs generated within TrueNAS, no errors, no health issues on the zpools, nothing. Which leads me to believe that the software is in an silent unstable state.

       

      I don't have any options right now, I can't downgrade back to 11.3-U5/6/7/etc since the zpool was upgraded on the three systems. But there's one things that dropped the ball regarding to trust with iX releasing the proper stable versions of TrueNAS.

       

      After the upgrade I've noted that 12.0-RELEASE was built with RC (Release Candidate) code:

       

      {CODE}

      Last login: Tue Dec  8 17:24:17 2020

      FreeBSD 12.2-RC3 7c4ec6ff02c(HEAD) TRUENAS

       

          TrueNAS (c) 2009-2020, iXsystems, Inc.

          All rights reserved.

          TrueNAS code is released under the modified BSD license with some

          files copyrighted by (c) iXsystems, Inc.

       

          For more information, documentation, help or support, go here:

          http://truenas.com

       

      FreeBSD freenas.win.versatushpc.com.br 12.2-RC3 FreeBSD 12.2-RC3 7c4ec6ff02c(HEAD) TRUENAS  amd64

      {CODE}

       

      OpenZFS 2.0 wasn't even released yet, leading to confusion. When 12.0-RELEASE was announced I understood that OpenZFS 2.0 was released together, but this seems not to be the case since the announcement was done two day ago, on December 10th: https://www.ixsystems.com/blog/openzfs-2-on-truenas

       

      What we got running on 12.0-RELEASE?

      {CODE}

      root@freenas:~ # pkg info | grep -i zfs

      beadm-1.4                      Solaris-like utility to manage Boot Environments on ZFS

      iohyve-0.7.9                   bhyve manager utilizing ZFS and other FreeBSD tools

      openzfs-2020100200             OpenZFS userland for FreeBSD

      openzfs-kmod-2020100200        OpenZFS kernel module for FreeBSD

      py38-libzfs-1.0.202008212020   Python libzfs bindings

      py38-zettarepl-0.1_24          Cross-platform ZFS replication solution

      {CODE}

       

      OpenZFS snapshot from October 2nd. This is not STABLE at all...

       

      In 12.0-U1 we got the proper released OpenZFS version, and a non RC FreeBSD 12 system. As we would expect from a RELEASE release.

      {CODE}

      FreeBSD freenas.win.versatushpc.com.br 12.2-RELEASE-p2 FreeBSD 12.2-RELEASE-p2 663e6b09467(HEAD) TRUENAS  amd64

       

      root@freenas:~ # pkg info | grep -i zfs

      beadm-1.4                      Solaris-like utility to manage Boot Environments on ZFS

      iohyve-0.7.9                   bhyve manager utilizing ZFS and other FreeBSD tools

      openzfs-2020120100             OpenZFS userland for FreeBSD

      openzfs-kmod-2020120100        OpenZFS kernel module for FreeBSD

      py38-libzfs-1.0.202011201432   Python libzfs bindings

      py38-zettarepl-0.1_27          Cross-platform ZFS replication solution

      {CODE}

       

      Yeah, so... given the evidence I cannot conclude anything different from: 12.0-RELEASE is not STABLE. It should not be marketed as STABLE in first place. Even upgrade to 12.0-U1 still marks 12.0-U1 as development branch and should not be used in production: https://jira.ixsystems.com/browse/NAS-108580; yes it may seems to be a cosmetic bug, but for paying customers TrueNAS 12 isn't available yet. So all this TrueNAS Core thing leads to extreme confusion. There is cleary two separate branches from the OpenSource release and the one that iX ships, which is fine, but this should be explained better.

       

      For now I don't even know if 12.0-U1 would solve the reported issues, and if 12.0-U1 will be considered stable. Because it's not.

       

      Regarding the original issue, I'm pretty much confident that the issues were consequence of running 12.0-RELEASE. People can blame me for "upgrading it too early" or "you should have paid support since your environment is critical", or other nonsenses like: "you probably don't know how to build proper ZFS systems". But the reality is that none of them applies to the situation.

       

      I know that iX is not responsible for this, this is FOSS software and delivered "as is"; this is just as an alert to keep running FreeNAS 11.3-U5/6/7/etc until things get really stable on the 12.0 branch. Keep an eye with the paying customers, look when they will receive the updates, I've read somewhere that this release will be on December 22th. We hope this will be stable, so people could have a proper Christmas and a good new year.

       

      Thanks for listening.

      PS: If there's any artifact that I can generate to help further investigate I'm totally willing to do it, but I don't know what I could provide to help it out. And now all the three pools were upgraded to 12.0-U1.

       

       

        Attachments

        1. bootProcess20210112.mov
          48.77 MB
          Vinícius Ferrão
        2. crashAfterKo1.png
          528 kB
          Vinícius Ferrão
        3. crashAfterKo2.png
          545 kB
          Vinícius Ferrão
        4. crashTrueNAS-2016system.png
          534 kB
          Vinícius Ferrão
        5. crashTrueNAS-2016system1.png
          433 kB
          Vinícius Ferrão
        6. crashTrueNAS-2016system2.png
          386 kB
          Vinícius Ferrão
        7. crashTrueNAS-2016system3.png
          550 kB
          Vinícius Ferrão
        8. crashTrueNAS-2016system4.png
          441 kB
          Vinícius Ferrão
        9. debug-2013system-20201212174318.tgz
          999 kB
          Vinícius Ferrão
        10. debug-2013system-20201213105843.tgz
          1010 kB
          Vinícius Ferrão
        11. debug-2014system-20201212174335.tgz
          2.44 MB
          Vinícius Ferrão
        12. debug-cwpstorage-20210118184657.tgz
          1.03 MB
          Peter Nunn
        13. debug-storage2013system-20201226025114.tgz
          1.29 MB
          Vinícius Ferrão
        14. debug-storage2013system-20210105121758.tgz
          1.50 MB
          Vinícius Ferrão
        15. debug-storage2013system-after-ryan-ko-20210115005211.tgz
          1.24 MB
          Vinícius Ferrão
        16. image-2020-12-28-15-19-54-502.png
          389 kB
          Vinícius Ferrão
        17. image-2020-12-28-15-25-05-806.png
          207 kB
          Vinícius Ferrão
        18. image-2021-01-13-12-19-12-282.png
          599 kB
          Eagleman
        19. image-2021-01-13-16-59-00-452.png
          269 kB
          Eagleman
        20. memtest86-2013system.png
          401 kB
          Vinícius Ferrão
        21. nootGoodSupported.png
          128 kB
          Vinícius Ferrão
        22. openzfs.ko
          4.93 MB
          Ryan Moeller
        23. openzfs-debug.ko
          6.21 MB
          Ryan Moeller
        24. unknownFS.jpg
          453 kB
          Vinícius Ferrão
        25. xfsCorruption.png
          220 kB
          Vinícius Ferrão
        26. xfsSuperblockGone.jpg
          109 kB
          Vinícius Ferrão
        27. xfsSuperblockGone2.jpeg
          126 kB
          Vinícius Ferrão
        28. xfsSuperblockGone3.jpg
          240 kB
          Vinícius Ferrão

          Attachments

            JEditor

              Issue Links

                Activity

                  People

                  Assignee:
                  mav Alexander Motin
                  Reporter:
                  viniciusferrao Vinícius Ferrão
                  Votes:
                  5 Vote for this issue
                  Watchers:
                  24 Start watching this issue

                    Dates

                    Created:
                    Updated:
                    Resolved: