Uploaded image for project: 'TrueNAS'
  1. TrueNAS
  2. NAS-109705

All replication hangs until system is rebooted after getting SSHException

    XMLWordPrintable

    Details

      Description

      There are 6 replication tasks that run every hour, but 10m offset from each other due to yet other issues running more than 2 replication jobs at once causing the system to panic.

      At some point a job will encounter the following exception:
      [2021/03/05 21:20:16] WARNING [retention] [zettarepl.zettarepl] Remote retention failed on <SSH Transport(rep-mad-file-01@192.168.16.50)>: error listing snapshots: SSHException('Timeout opening channel.')

      After that point ALL replication jobs will be stuck in WAITING status claiming the job is already running.

      The only way to clear this state is to reboot the system.

      Again, I've checked the box to attach a debug, but I suspect the bug about running the debug still exists and thus it won't be automatically attached.

        Attachments

        1. zettarepl.log
          175 kB
        2. summary.log
          11 kB
        3. debug_20210305_1428
          8.51 MB

          Attachments

            JEditor

              Activity

                People

                Assignee:
                vladimirv Vladimir Vinogradenko
                Reporter:
                joshwisely Josh Wisely
                Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                  Dates

                  Created:
                  Updated:
                  Resolved: