Uploaded image for project: 'TrueNAS'
  1. TrueNAS
  2. NAS-109720

Interrupted full replications are silently failing to resume

    XMLWordPrintable

    Details

    • Impact:
      High

      Description

      When a replication is configured for "Full filesystem replication", and a replication task is interrupted for any reason, leading to a resume token to be saved on the remote machine, truenas becomes unable to fully replicate the dataset.

      Instead, every time the task runs, it partially replicates the children data set until stopping at the one containing the resume token. It then emits the below error message in the log, and the task succeeds without further action. 

      [2021/03/08 17:00:01] INFO     [Thread-75] [zettarepl.paramiko.replication_task__task_3] Connected (version 2.0, client OpenSSH_7.9)
      [2021/03/08 17:00:02] INFO     [Thread-75] [zettarepl.paramiko.replication_task__task_3] Authentication (publickey) successful!
      [2021/03/08 17:00:16] INFO     [replication_task__task_3] [zettarepl.replication.run] For replication task 'task_3': doing push from '<src>' to '<dst>' of snapshot='auto-hourly-20210308.1700' incremental_base='auto-hourly-20210308.1600' receive_resume_token=None encryption=False
      [2021/03/08 17:02:30] WARNING  [replication_task__task_3] [zettarepl.replication.run] For task 'task_3' at attempt 1 recoverable replication error RecoverableReplicationError('cannot receive incremental stream: destination <dst>/child contains partially-complete state from "zfs receive -s".\nwarning: cannot send \'<src>/other-child@auto-hourly-20210308.1700\': signal received')
      [2021/03/08 17:02:30] INFO     [replication_task__task_3] [zettarepl.replication.run] After recoverable error sleeping for 1 seconds
      [2021/03/08 17:02:35] INFO     [replication_task__task_3] [zettarepl.replication.run] No snapshots to send for replication task 'task_3' on dataset '<src>'

      The previously interrupted child data set, as well as any subsequent children in replication order, are from that point onwards not replicated.

      The only indication in the UI that something is wrong, is that the Last Snapshot field of the replication task remains stuck on the last one before the interruption happened.

       

        Attachments

          Attachments

            JEditor

              Activity

                People

                Assignee:
                vladimirv Vladimir Vinogradenko
                Reporter:
                dualmoo David Durrleman
                Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                  Dates

                  Created:
                  Updated:
                  Resolved: