Interrupted full replications are silently failing to resume

Description

When a replication is configured for "Full filesystem replication", and a replication task is interrupted for any reason, leading to a resume token to be saved on the remote machine, truenas becomes unable to fully replicate the dataset.

Instead, every time the task runs, it partially replicates the children data set until stopping at the one containing the resume token. It then emits the below error message in the log, and the task succeeds without further action.

[2021/03/08 17:00:01] INFO [Thread-75] [zettarepl.paramiko.replication_task__task_3] Connected (version 2.0, client OpenSSH_7.9)
[2021/03/08 17:00:02] INFO [Thread-75] [zettarepl.paramiko.replication_task__task_3] Authentication (publickey) successful!
[2021/03/08 17:00:16] INFO [replication_task__task_3] [zettarepl.replication.run] For replication task 'task_3': doing push from '<src>' to '<dst>' of snapshot='auto-hourly-20210308.1700' incremental_base='auto-hourly-20210308.1600' receive_resume_token=None encryption=False
[2021/03/08 17:02:30] WARNING [replication_task__task_3] [zettarepl.replication.run] For task 'task_3' at attempt 1 recoverable replication error RecoverableReplicationError('cannot receive incremental stream: destination <dst>/child contains partially-complete state from "zfs receive -s".\nwarning: cannot send \'<src>/other-child@auto-hourly-20210308.1700\': signal received')
[2021/03/08 17:02:30] INFO [replication_task__task_3] [zettarepl.replication.run] After recoverable error sleeping for 1 seconds
[2021/03/08 17:02:35] INFO [replication_task__task_3] [zettarepl.replication.run] No snapshots to send for replication task 'task_3' on dataset '<src>'

The previously interrupted child data set, as well as any subsequent children in replication order, are from that point onwards not replicated.

The only indication in the UI that something is wrong, is that the Last Snapshot field of the replication task remains stuck on the last one before the interruption happened.

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

David Durrleman 
November 14, 2021 at 10:28 PM

After installing 12.0-U6, I confirm that I still observe the issue mentioned in my comment above: 'cannot receive incremental stream: most recent snapshot of <dst>/.system/syslog-<hash> does not\nmatch incremental source\n'

I suspect this happens despite the fixes already implemented, when the replication fails midway, and the second "recovery" attempt still fails (let's say because the network connection is temporarily down)

Next time the replication runs, a new snapshot has been taken, and the fix code no longer works (because it only looks at the most recent snapshot, which is now the new one).

I have opened a PR. If it looks good, I would appreciate if you could include it as part of the next version of TrueNAS Core
 

David Durrleman 
September 15, 2021 at 5:42 PM

Indeed, I had only backported a single PR, so my bad.

I will calmly wait for 12.0-U6.

Vladimir Vinogradenko 
September 15, 2021 at 11:16 AM

did you backport a PR or did you use the latest file from truenas/12.0-stable branch? The issue you are talking about was fixed by https://jira.ixsystems.com/browse/NAS-111782 (which will also be present in 12.0-U6)

David Durrleman 
September 15, 2021 at 10:55 AM

yes I know that. This is why I backported the fix (I manually applied the diff to /usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py).

What I'm explaining here is that even when the fix becomes available, i.e. when 12.0-U6 is released, things will still be broken.

In simple terms, here is why:

  • When a "full" replication fails, it can leave two things on the remote data set

    • Resume tokens

    • and some child snapshots that were correctly replicated (let's say you have 3 children and the replication fails in the middle of replicating child2, the latest snapshot of child1 will be present on the remote dataset, whereas those of child2 and child3 won't)

  • The fix as of today only removes the resume token before trying to replicate

  • zettarepl only supports replicating a full dataset where all the children's latest snapshot are the same, hence it still fails

  • The failures pasted above correctly show the new error message, which is different than the one I had before: "cannot receive incremental stream: most recent snapshot of <dst>/.system/syslog-<hash> does not\nmatch incremental source\n"

Vladimir Vinogradenko 
September 15, 2021 at 9:52 AM

sorry, it will only be available in U6, not in 5.1

Complete

Details

Assignee

Reporter

Labels

Impact

Components

Affects versions

Priority

More fields

Katalon Platform

Created March 8, 2021 at 4:08 PM
Updated July 1, 2022 at 5:13 PM
Resolved September 10, 2021 at 6:25 AM