Details
-
Type:
Bug
-
Status: Engineering Closed (View Workflow)
-
Priority:
Low
-
Resolution: Cannot Reproduce
-
Affects Version/s: 11.3-U3.2
-
Fix Version/s: N/A
-
Component/s: Tasks
-
Labels:
-
Impact:Medium
Description
A Replication Task has a field called "Number of retries for failed replications" with a default value of 5. This value appears to have no effect.
I recently experienced many interruptions in a long-running replication task. I had tried increasing this 5 to first 500 then 50000 due to frustration with the job giving up quickly. Scanning `/var/log/zettarepl.log` shows that only a single error is enough to kill the job, even the default value of 5 is ignored, nevermind larger ones.
One example error that caused an immediate disconnection is described in https://jira.ixsystems.com/browse/NAS-107550 . Another is this SSH error:
[2020/09/13 21:00:30] INFO [replication_task__task_6] [zettarepl.replication.run] For replication task 'task_6': doing pull from 'tank/Pictures' to 'tank/blahblahblah/Pictures' of snapshot='auto-2020-09-12_00-00' incremental_base='auto-2020-09-11_00-00' receive_resume_token=None [2020/09/13 21:00:30] ERROR [replication_task__task_6] [zettarepl.replication.run] For task 'task_6' unhandled replication error SSHException('SSH session not active') Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 143, in run_replication_tasks run_replication_task_part(replication_task, source_dataset, src_context, dst_context, observer) File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 204, in run_replication_task_part run_replication_steps(step_templates, observer) File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 389, in run_replication_steps replicate_snapshots(step_template, incremental_base, snapshots, observer) File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 444, in replicate_snapshots run_replication_step(step_template.instantiate(incremental_base=incremental_base, snapshot=snapshot), observer) File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 500, in run_replication_step ReplicationProcessRunner(process, monitor).run() File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/process_runner.py", line 22, in run self.replication_process.run() File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/ssh.py", line 64, in run self.report_progress = self._zfs_send_can_report_progress() File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/ssh.py", line 160, in _zfs_send_can_report_progress send_shell.exec(["zfs", "send", "-V"]) File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/interface.py", line 83, in exec return self.exec_async(args, encoding, stdout).wait() File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/interface.py", line 87, in exec_async async_exec.run() File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/base_ssh.py", line 28, in run "sh -c " + shlex.quote(" ".join([shlex.quote(arg) for arg in self.args]) + " 2>&1"), timeout=10) File "/usr/local/lib/python3.7/site-packages/paramiko/client.py", line 508, in exec_command chan = self._transport.open_session(timeout=timeout) File "/usr/local/lib/python3.7/site-packages/paramiko/transport.py", line 879, in open_session timeout=timeout, File "/usr/local/lib/python3.7/site-packages/paramiko/transport.py", line 969, in open_channel raise SSHException("SSH session not active") paramiko.ssh_exception.SSHException: SSH session not active
This error appeared when a snapshot completed. The replication task then stopped immediately instead of replicating the next snapshot. Manually starting the job in the UI let it run to completion.
If it matters, this particular task uses the PULL direction.
I marked the impact as Medium because this causes unreliable replications. A daily replication that may require a significant portion of the day to complete may cause a significant backlog if it sits idle for the entire day waiting to be scheduled a second time. Today avoiding this fate requires manual babysitting.