Updating encryption properties on a dataset makes all children unreadable + kernel panic
Description
Problem/Justification
Impact
Activity
Alexander Motin January 11, 2022 at 9:04 PM
As I see, the panic must be directly related to the attempt to change the dataset encryption properties, but it seems spa_keystore_dsl_key_hold_dd() returned EACCESS error for one of nested datasets, which for some reason is considered normal in other case, but somehow impossible in this one, that triggers an assertion. I've tried to play with the encryption setting in different ways, but so far was unable to reproduce it. May be it is related to the previous replication of this dataset tree somehow. From your description is seems like you are replicating already encrypted pool with number of datasets inheriting encryption settings. But you've also said that you've set encryption keys on replication task, that I don't think you should do in that case. If you can reproduce this issue, could you describe precise minimal reproduction scenario for this issue?
Alexander Motin January 11, 2022 at 7:02 PM(edited)
In the debug I see 4 identical panics, so just for reference the panic message in text form:
panic: VERIFY3(0 == spa_keystore_dsl_key_hold_dd(dp->dp_spa, dd, FTAG, &dck)) failed (0 == 13)cpuid = 3
time = 1637900643
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0101268590
vpanic() at vpanic+0x17b/frame 0xfffffe01012685e0
spl_panic() at spl_panic+0x3a/frame 0xfffffe0101268640
spa_keystore_change_key_sync_impl() at spa_keystore_change_key_sync_impl+0x3de/frame 0xfffffe01012686b0
spa_keystore_change_key_sync_impl() at spa_keystore_change_key_sync_impl+0x139/frame 0xfffffe0101268720
spa_keystore_change_key_sync() at spa_keystore_change_key_sync+0x2ce/frame 0xfffffe01012687f0
dsl_sync_task_sync() at dsl_sync_task_sync+0xb4/frame 0xfffffe0101268820
dsl_pool_sync() at dsl_pool_sync+0x44b/frame 0xfffffe01012688a0
spa_sync() at spa_sync+0xa50/frame 0xfffffe0101268ae0
txg_sync_thread() at txg_sync_thread+0x413/frame 0xfffffe0101268bb0
fork_exit() at fork_exit+0x7e/frame 0xfffffe0101268bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0101268bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
Such consistency of the panics and crash in specific area of ZFS tell it is likely a ZFS encryption bug, possibly related to your attempt to change the keys, not a hardware failure.
Dxun November 28, 2021 at 3:30 PM
Added additional hints to the problem (and how I was able to get around it) as comment here.
Summary
This is a spin-off from https://ixsystems.atlassian.net/browse/NAS-113477#icft=NAS-113477 - that one describes problems on a completely separate machine (let's call that one S , for source ) that is replicating its encrypted pool and all its datasets to this machine (let's call this one T , for target ) I am reporting this issue on.
On T I have encountered this kernel panic, a reboot....and I seem to have effectively lost the entire pool (the data seems still there but cannot be accessed).
Details
I was trying to understand some nuances around replicating an encrypted pool - in particular, I was interested to understand how I could apply deltas in a pool in S to a pool in T .
Here is what I wanted to test:
1) replicate S -> T (full replication - done)
2) make some changes to S (done)
3) manually create snapshots on S (recursively from the top dataset - done)
4) replicate S > T (replicate deltas only failed, see below)
5) observe T mirroring S (this is what I had expected)
I came across a difficulty in step 4) as I've changed the T (target) so that all its datasets are inheriting the encryption from the "root dataset" (a dataset that I am inheriting all my settings for and that serves as a target to the replication task on S ).
To my understanding, this effectively meant that I had to update the S -> T replication task with the new HEX encryption key with the key that encrypts the system dataset on T .
I did that, but ran across "permission denied" errors when executing S -> T replication task, so after many failed attempts, I had decided to revert the encryption key on T with the encryption key from the system dataset on S (in hopes that I could then proceed with the S -> T replication without needing to configure extra encryption).
Here is how I tried to do it - on T , I exported the system dataset (the `storage` dataset down below) encryption key and then tried to set that key as the encryption key on the child of the system dataset (the `zroot` dataset down below). Here is how that looked like:
To be clear, this is me trying to update the T (target) dataset with the encryption key from S (source). After some 15 seconds or so, there would be a crash and kernel panic (captured from the IPMI output - see below).
What also might be interesting is that the system alerts report no issues - not even an unscheduled reboot. This is a (surprising) departure from what I had observed on NAS-113477.
Consequently, I am not sure how to attach any core dump. I am attaching the debug info, however.
I also haven't ran the memtest86 for a long time but had this machine running for months and perform at least 10-15 scrubs without errors.
In fact, the only time I started to experience any kind of strange behaviour (and random failures) in last several years are completely tied to using zettarepl and encrypted pools.