Uploaded image for project: 'TrueNAS'
  1. TrueNAS
  2. NAS-107550

ambiguous replication error requires clarification in UI

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Engineering Closed (View Workflow)
    • Priority: Low
    • Resolution: Cannot Reproduce
    • Affects Version/s: 11.3-U3.2
    • Fix Version/s: N/A
    • Component/s: None
    • Impact:
      Low

      Description

      Background:

      I have a PULL replication task between two home FreeNAS 11.3 systems. Recently a change required 22 hours of transfer time. There happened to be many connection interruptions on the network that caused the job to fail several times.


      Observed Error:

      The Replication Task is marked as FAILED and the alert system reports:

          CRITICAL
          Replication "replication pull" failed: client_loop: send disconnect: Broken pipe cannot receive resume stream: checksum mismatch or incomplete stream. Partially received snapshot is saved. A resuming stream can be generated on the sending system by running: zfs send -t 1-1257942c96-138-789c636064000310a501c49c50360710a715e5e7a69766a63040c1e3c7ab53396c84f415806c762475f94959a9c9250c0c9b573183d561c8a7a515a79680641ce0f26c48f2499525a9c520f1687f63162cfa4bf221ae68d328be6c397b6bac03923c27583e2f3137154827e665eb076426979416a5163b249696e4eb1a191819e81a58ea1a1ac41b00190670fb791910fece492c4a4f4dcac94fcececf86c97323c927e7e716004d2c86c8420000f0692ecb..
          Sat, 12 Sep 2020 12:53:32 PM (America/Chicago)
          Dismiss

      This error message is problematic:

          The message implies that the user should run the `zfs send -t` command on the source system to resume the transfer. This instruction is both unclear ("then what?") and unnecessary. Simply restarting the replication task will resume the transfer in that way.
          The message is incomplete. The long string passed to `zfs send -t` is truncated and hence useless. The full string can only be found in `/var/log/zettarepl.log`, and the UI doesn't guide the user there at all. That message is below.

      Solution:

          The error message should be changed to help the user understand what to do. It should just say to restart the job.
          Even better, the job should retry on its own. It is frustrating to notice that the network has been idle for hours when it could have used that time to catch up. It means that the system has to be babysat. Increasing the retry count appears to have no effect. I tried increasing the default value of 5 to 500, 5000, and 50000 but the job would still error out easily. `grep "Partially received snapshot is saved." /var/log/zettarepl.log | wc -l` returns 16. That is roughly the number of times I had to manually run the job. The retry counter clearly has no effect.

      The full log message in `/var/log/zettarepl.log` is:

      [2020/09/11 00:58:32] ERROR    [replication_task__task_6] [zettarepl.replication.run] For task 'task_6' unhandled replication error ExecException(255, 'send from @auto-2020-09-09_00-00 to tank/Pictures@auto-2020-09-10_00-00 estimated size is 20.6G\nsend from @auto-2020-09-09_00-00 to tank/Pictures@auto-2020-09-10_00-00 estimated size is 20.6G\nclient_loop: send disconnect: Broken pipe\ncannot receive incremental stream: checksum mismatch or incomplete stream.\nPartially received snapshot is saved.\nA resuming stream can be generated on the sending system by running:\n    zfs send -t 1-15e466db44-138-789c636064000310a501c49c50360710a715e5e7a69766a63040c1e3c7ab53396c84f415806c762475f94959a9c9250c0c3db399c1ea30e4d3d28a534b603210920d493ea9b224b518482788576962d35f920f71459b46f165cbd95b631d90e439c1f27989b9a9403a312f5b3f2033b9a4b428b5d821b1b4245fd7c8c0c840d7c052d7d020de00c83080dbcfcb80f0774e62517a6a524e7e72767e364c9e1b493e393fb70068623144160200b6582e8e\n')
      Traceback (most recent call last):
        File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 143, in run_replication_tasks
          run_replication_task_part(replication_task, source_dataset, src_context, dst_context, observer)
        File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 204, in run_replication_task_part
          run_replication_steps(step_templates, observer)
        File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 389, in run_replication_steps
          replicate_snapshots(step_template, incremental_base, snapshots, observer)
        File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 444, in replicate_snapshots
          run_replication_step(step_template.instantiate(incremental_base=incremental_base, snapshot=snapshot), observer)
        File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/run.py", line 500, in run_replication_step
          ReplicationProcessRunner(process, monitor).run()
        File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/process_runner.py", line 33, in run
          raise self.process_exception
        File "/usr/local/lib/python3.7/site-packages/zettarepl/replication/process_runner.py", line 37, in _wait_process
          self.replication_process.wait()
        File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/ssh.py", line 146, in wait
          self.async_exec.wait()
        File "/usr/local/lib/python3.7/site-packages/zettarepl/transport/async_exec_tee.py", line 100, in wait
          raise ExecException(exit_event.returncode, self.output)
      zettarepl.transport.interface.ExecException: send from @auto-2020-09-09_00-00 to tank/Pictures@auto-2020-09-10_00-00 estimated size is 20.6G
      send from @auto-2020-09-09_00-00 to tank/Pictures@auto-2020-09-10_00-00 estimated size is 20.6G
      client_loop: send disconnect: Broken pipe
      cannot receive incremental stream: checksum mismatch or incomplete stream.
      Partially received snapshot is saved.
      A resuming stream can be generated on the sending system by running:
          zfs send -t 1-15e466db44-138-789c636064000310a501c49c50360710a715e5e7a69766a63040c1e3c7ab53396c84f415806c762475f94959a9c9250c0c3db399c1ea30e4d3d28a534b603210920d493ea9b224b518482788576962d35f920f71459b46f165cbd95b631d90e439c1f27989b9a9403a312f5b3f2033b9a4b428b5d821b1b4245fd7c8c0c840d7c052d7d020de00c83080dbcfcb80f0774e62517a6a524e7e72767e364c9e1b493e393fb70068623144160200b6582e8e

      Message with automatic resume, note the `receive_resume_token`:

      [2020/09/13 15:55:52] INFO     [replication_task__task_6] [zettarepl.replication.run] For replication task 'task_6': doing pull from 'tank/Pictures' to 'tank/blahblahblah/Pict
      ures' of snapshot=None incremental_base=None receive_resume_token='1-123d227ae8-138-789c636064000....cecf2d009a580c91850000fb05303b'

        Attachments

          Attachments

            JEditor

              Activity

                People

                Assignee:
                releng Triage Team
                Reporter:
                alugowski Adam Lugowski
                Watchers:
                Adam Lugowski, Bonnie Follweiler, Joe Maloney
                Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                  Dates

                  Created:
                  Updated:
                  Resolved: