Uploaded image for project: 'TrueNAS'
  1. TrueNAS
  2. NAS-107551

Replication Task retry counter has no effect

    XMLWordPrintable

    Details

    • Impact:
      Medium

      Description

      A Replication Task has a field called "Number of retries for failed replications" with a default value of 5. This value appears to have no effect.

       

      I recently experienced many interruptions in a long-running replication task. I had tried increasing this 5 to first 500 then 50000 due to frustration with the job giving up quickly. Scanning `/var/log/zettarepl.log` shows that only a single error is enough to kill the job, even the default value of 5 is ignored, nevermind larger ones.

       

      One example error that caused an immediate disconnection is described in https://jira.ixsystems.com/browse/NAS-107550 . Another is this SSH error:

      [2020/09/13 21:00:30] INFO     [replication_task__task_6] [zettarepl.replication.run] For replication task 'task_6': doing pull from 'tank/Pictures' to 'tank/blahblahblah/Pictures' of snapshot='auto-2020-09-12_00-00' incremental_base='auto-2020-09-11_00-00' receive_resume_token=None
      [2020/09/13 21:00:30] ERROR    [replication_task__task_6] [zettarepl.replication.run] For task 'task_6' unhandled replication error SSHException('SSH session not active')
      Traceback (most recent call last):
        File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 143, in run_replication_tasks
          run_replication_task_part(replication_task, source_dataset, src_context, dst_context, observer)
        File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 204, in run_replication_task_part
          run_replication_steps(step_templates, observer)
        File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 389, in run_replication_steps
          replicate_snapshots(step_template, incremental_base, snapshots, observer)
        File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 444, in replicate_snapshots
          run_replication_step(step_template.instantiate(incremental_base=incremental_base, snapshot=snapshot), observer)
        File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 500, in run_replication_step
          ReplicationProcessRunner(process, monitor).run()
        File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/process_runner.py", line 22, in run
          self.replication_process.run()
        File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/ssh.py", line 64, in run
          self.report_progress = self._zfs_send_can_report_progress()
        File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/ssh.py", line 160, in _zfs_send_can_report_progress
          send_shell.exec(["zfs", "send", "-V"])
        File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/interface.py", line 83, in exec
          return self.exec_async(args, encoding, stdout).wait()
        File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/interface.py", line 87, in exec_async
          async_exec.run()
        File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/base_ssh.py", line 28, in run
          "sh -c " + shlex.quote(" ".join([shlex.quote(arg) for arg in self.args]) + " 2>&1"), timeout=10)
        File "/usr/local/lib/python3.7/site-packages/paramiko/client.py", line 508, in exec_command
          chan = self._transport.open_session(timeout=timeout)
        File "/usr/local/lib/python3.7/site-packages/paramiko/transport.py", line 879, in open_session
          timeout=timeout,
        File "/usr/local/lib/python3.7/site-packages/paramiko/transport.py", line 969, in open_channel
          raise SSHException("SSH session not active")
      paramiko.ssh_exception.SSHException: SSH session not active

      This error appeared when a snapshot completed. The replication task then stopped immediately instead of replicating the next snapshot. Manually starting the job in the UI let it run to completion.

      If it matters, this particular task uses the PULL direction.

      I marked the impact as Medium because this causes unreliable replications. A daily replication that may require a significant portion of the day to complete may cause a significant backlog if it sits idle for the entire day waiting to be scheduled a second time. Today avoiding this fate requires manual babysitting.

        Attachments

          Attachments

            JEditor

              Activity

                People

                Assignee:
                releng Triage Team
                Reporter:
                alugowski Adam Lugowski
                Watchers:
                Adam Lugowski, Bonnie Follweiler, Joe Maloney
                Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                  Dates

                  Created:
                  Updated:
                  Resolved: