Interrupted full replications are silently failing to resume

Description

When a replication is configured for "Full filesystem replication", and a replication task is interrupted for any reason, leading to a resume token to be saved on the remote machine, truenas becomes unable to fully replicate the dataset.

Instead, every time the task runs, it partially replicates the children data set until stopping at the one containing the resume token. It then emits the below error message in the log, and the task succeeds without further action.

[2021/03/08 17:00:01] INFO [Thread-75] [zettarepl.paramiko.replication_task__task_3] Connected (version 2.0, client OpenSSH_7.9)
[2021/03/08 17:00:02] INFO [Thread-75] [zettarepl.paramiko.replication_task__task_3] Authentication (publickey) successful!
[2021/03/08 17:00:16] INFO [replication_task__task_3] [zettarepl.replication.run] For replication task 'task_3': doing push from '<src>' to '<dst>' of snapshot='auto-hourly-20210308.1700' incremental_base='auto-hourly-20210308.1600' receive_resume_token=None encryption=False
[2021/03/08 17:02:30] WARNING [replication_task__task_3] [zettarepl.replication.run] For task 'task_3' at attempt 1 recoverable replication error RecoverableReplicationError('cannot receive incremental stream: destination <dst>/child contains partially-complete state from "zfs receive -s".\nwarning: cannot send \'<src>/other-child@auto-hourly-20210308.1700\': signal received')
[2021/03/08 17:02:30] INFO [replication_task__task_3] [zettarepl.replication.run] After recoverable error sleeping for 1 seconds
[2021/03/08 17:02:35] INFO [replication_task__task_3] [zettarepl.replication.run] No snapshots to send for replication task 'task_3' on dataset '<src>'

The previously interrupted child data set, as well as any subsequent children in replication order, are from that point onwards not replicated.

The only indication in the UI that something is wrong, is that the Last Snapshot field of the replication task remains stuck on the last one before the interruption happened.

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

David Durrleman
November 14, 2021 at 10:28 PM

After installing 12.0-U6, I confirm that I still observe the issue mentioned in my comment above: 'cannot receive incremental stream: most recent snapshot of <dst>/.system/syslog-<hash> does not\nmatch incremental source\n'

I suspect this happens despite the fixes already implemented, when the replication fails midway, and the second "recovery" attempt still fails (let's say because the network connection is temporarily down)

Next time the replication runs, a new snapshot has been taken, and the fix code no longer works (because it only looks at the most recent snapshot, which is now the new one).

I have opened a PR. If it looks good, I would appreciate if you could include it as part of the next version of TrueNAS Core

David Durrleman
September 15, 2021 at 5:42 PM

Indeed, I had only backported a single PR, so my bad.

I will calmly wait for 12.0-U6.

Vladimir Vinogradenko
September 15, 2021 at 11:16 AM

did you backport a PR or did you use the latest file from truenas/12.0-stable branch? The issue you are talking about was fixed by https://jira.ixsystems.com/browse/NAS-111782 (which will also be present in 12.0-U6)

David Durrleman
September 15, 2021 at 10:55 AM

yes I know that. This is why I backported the fix (I manually applied the diff to /usr/local/lib/python3.9/site-packages/zettarepl/replication/run.py).

What I'm explaining here is that even when the fix becomes available, i.e. when 12.0-U6 is released, things will still be broken.

In simple terms, here is why:

When a "full" replication fails, it can leave two things on the remote data set
- Resume tokens
- and some child snapshots that were correctly replicated (let's say you have 3 children and the replication fails in the middle of replicating child2, the latest snapshot of child1 will be present on the remote dataset, whereas those of child2 and child3 won't)
The fix as of today only removes the resume token before trying to replicate
zettarepl only supports replicating a full dataset where all the children's latest snapshot are the same, hence it still fails
The failures pasted above correctly show the new error message, which is different than the one I had before: "cannot receive incremental stream: most recent snapshot of <dst>/.system/syslog-<hash> does not\nmatch incremental source\n"

Vladimir Vinogradenko
September 15, 2021 at 9:52 AM

sorry, it will only be available in U6, not in 5.1

Resize issue view side panel

Complete

Details

Assignee

Vladimir Vinogradenko

Reporter

David Durrleman

Labels

Impact

High

Components

Fix versions

12.0-U6

SCALE-21.08-BETA.1

SCALE-21.04-ALPHA.1

Affects versions

12.0-U2.1

Priority

Low

More fields

Katalon Platform

Created March 8, 2021 at 4:08 PM

Updated July 1, 2022 at 5:13 PM

Resolved September 10, 2021 at 6:25 AM

Interrupted full replications are silently failing to resume

Description

Problem/Justification

Impact

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

David Durrleman November 14, 2021 at 10:28 PM

David Durrleman September 15, 2021 at 5:42 PM

Vladimir Vinogradenko September 15, 2021 at 11:16 AM

David Durrleman September 15, 2021 at 10:55 AM

Vladimir Vinogradenko September 15, 2021 at 9:52 AM

Details

Assignee

Reporter

Labels

Impact

Components

Fix versions

Affects versions

Priority

More fields

Katalon Platform

David Durrleman
November 14, 2021 at 10:28 PM

David Durrleman
September 15, 2021 at 5:42 PM

Vladimir Vinogradenko
September 15, 2021 at 11:16 AM

David Durrleman
September 15, 2021 at 10:55 AM

Vladimir Vinogradenko
September 15, 2021 at 9:52 AM