Data over iSCSI Corrupted with TrueNAS 12.0-U1.1

Description

Upgrading to 12.0-U1.1 caused any data set presented by iSCSI to show as corrupt data in ESXi, rolling back to 11.3-U3.1 allowed me to access and recover VMDK's for VM's that I did not turn on or heavily use as the corruption was recoverable by the built in self recovery tools.

Original Text from Reddit
So I rolled back and most of my VM's recovered their journal, all vms where fine and I was doing maintenance on them before I updated to u1.1.

The extent exists on a dedicated pool with a zvol associated with the data set. The pool is made of 2 2tb drives in a zfs mirror.

I have been running FreeNAS-11.3-U3.1 since 5/28/2020, yesterday I upraded to 12 u1.1 and the update ran smoothly. After performing VM maintainence on VMWare ESXi 6.5 Update 2 with most VM's running on this iSCSI target, I shut down all vms on the iSCSI and ran the update to TrueNAS. Upon reboot I started booting the first VM, I was greeted with this

All of my windows VM's showed this

These as well

I have 48GB of ECC ram and a E5-1620 v3 @ 3.50GHz a, 8 6tb WD red pros and 2 3tb WD Reds that the iSCSI Pool/extent exists on.

I had done multiple reboots and updates to many of the VM's on the iSCSI target before the upgrade and the ones with the most unrecoverable disk errors according to the OS are the ones that had the most time running over iSCSI on TruNAS. Many of the VM's that I shared screenshots of failing to boot once rolled back to 11.3 showed failures and required disk checks (SFC, chkdsk, fsck, etc) to get rolling again, unfortunately all data was not recoverable such as the rancher cluster I tried to boot up it fried one node to an unrecoverable state and the other 2 lost their etcd, so I am assuming the data damaged is coorlated to what ever was attempted to read/write in some way.

I copied one VM to a local disk in esx, and that vmdk was entirely mangled, and unrecoverable, its partition table was in tact (NTFS), but every single file on it was noted as corrupt or non existent, so you could browse the folders but not access data, upon doing a check disk it determined every file was corrupt.

All of this is new to me (the catastrophic failure) as I have done this sort of upgrade more times than I can count and I have never seen something like this occur. The non VM related data in the other pool appeared to be fine, but I am skeptical now if that data didn't get damaged in some way.

I was at least capable of maintaining one of my AD nodes so I can regain high availability from that, so that is my one silver lining here that I don't have to start everything over from scratch, just most things as the rancer related VM's that are unrecoverable are kubernetes clusters and that is where 99.9% of my home infrastructure was running, granted for persistence it was storing data on NFS on the other pool that seems to be ok, so I shouldn't have lost that data, but there is 9tb of personal data there so I am again skeptical if I didn't lose anything there.

In the end I put all of my VM's even the "HA" services on the same piece of hardware so my losses are on me, but having an excellent run of well over a decade of moving data from freenas host to freenas host made me complacent I guess.

TL;DR

Everything was great before the update and did maintenance on other parts of the infrastructure I run before the upgrade, post upgrade that one change starting killing VMDK's fast, rolling back to the 11.3x stopped the massive corruption, very confused, very sad, have a large rebuild in front of me now.

I really feel bad for the enterprises I know that are using this to store data though.

Also at no point did I ever run out of space on the pool, and most of the VM's where thick provisioned. Figured that may be relevant as out of space can cause some interesting side effects. The zvol is set to a 2tb size on 3tb drives and I avoid thin provisioning there to ensure I do not encounter those fun side effects.

The zvol was using the default sync as I was asked about it in the reddit thread I posted in originally.
https://www.reddit.com/r/freenas/comments/kydtq9/new_120_u1_hotfix_update/gjjttgy?utm_source=share&utm_medium=web2x&context=3

Problem/Justification

None

Impact

None

Linked work items

is duplicated by

NAS-106917

iSCSI - Windows reports disk errors

NAS-108559

iSCSI corruption with Windows 10

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Largo

February 11, 2021 at 4:34 AM

The fix appears to be working.

The same vm's that failed on 12-u1.1 are now functional on 12-u2. Thanks again guys.

Largo

February 11, 2021 at 3:29 AM

I just saw this update was released. I am going to test this on my system tonight and report back if I have a very sad night. Thanks for the support on some ancient hardware that you guys could have easily said just don't use that any more .

Alexander Motin

January 31, 2021 at 6:02 PM

I've got couple reviews very quickly, so the patch is committed to FreeBSD head and merged into upcoming TrueNAS 12.0-U2.

Largo

January 30, 2021 at 8:20 AM

Great news @Alexander Motin when you have a release with the patch in it I will test it and report back my findings, had less time than I wished to work on building a test environment after I got my core infrastructure back and running.

Alexander Motin

January 30, 2021 at 6:26 AM

I think I've got it. The cxgb(4) driver has few ugly optimizations from 12-13 years ago, based on assumption that external data pointed by mbuf(9) structure are physically contiguous. But iSCSI target optimization in TrueNAS 12.0 broke that assumption to not allocate and handle tons of small 4KB mbufs per I/O. Other network drivers using standard bus_dmamap_load_mbuf_sg(9) function to convert mbuf(9) into chain of physical addresses are handling that just fine, while cxgb(4) with its custom approach appears to just transfer random bits of memory if virtually contiguous CTL I/O buffers appear not physically contiguous due to memory fragmentation. I've made a small patch removing this ugliness, and first tests look promising, but I'll take another look and ask for review on a fresh head.

Resize work item view side panel

Complete

Details

Assignee

Alexander Motin

Reporter

Largo

Impact

High

Fix versions

12.0-U2

Affects versions

12.0-U1.1

Priority

High

More fields

Katalon Platform

Created January 17, 2021 at 7:36 PM

Updated July 1, 2022 at 5:14 PM

Resolved January 31, 2021 at 6:02 PM