replication does not work after upgrade to 12.0-U2.1

Description

Upgraded from 11.2-U8 to 12.0-U2.1 (with a stopover at 11.3-U5). Manual upgrade, since there was option to upgrade to 12.0 directly from 11.2. or 11.3 for that matter.

Replication before upgrade to a 12.0-U2.1 system worked fine before upgrade. After upgrade, and after fixing ssh connections, replication still does not work.

Turned on DEBUG logging in Replication Task, I got some weird output in debug.log:

[2021/03/02 19:55:05] INFO [replication_task__task_3] [zettarepl.replication.run] For replication task 'task_3': doing push from 'vol' to 'vol' of snapshot='auto-20210302.1800-2d' incremental_base='auto-20210302.1040-2d' receive_resume_token=None encryption=False

[2021/03/02 19:55:05] DEBUG [replication_task__task_3] [zettarepl.transport.base_ssh.root@vis-backup.shell.95.async_exec.5028] Running ['zfs', 'umount', 'vol']

[2021/03/02 19:55:05] DEBUG [replication_task__task_3] [zettarepl.paramiko.replication_task__task_3] [chan 60] Max packet in: 32768 bytes

[2021/03/02 19:55:05] DEBUG [Thread-194] [zettarepl.paramiko.replication_task__task_3] [chan 60] Max packet out: 32768 bytes

[2021/03/02 19:55:05] DEBUG [Thread-194] [zettarepl.paramiko.replication_task__task_3] Secsh channel 60 opened.

[2021/03/02 19:55:05] DEBUG [Thread-194] [zettarepl.paramiko.replication_task__task_3] [chan 60] Sesch channel 60 request ok

[2021/03/02 19:55:05] DEBUG [replication_task__task_3] [zettarepl.transport.base_ssh.root@vis-backup.shell.95.async_exec.5028] Reading stdout

[2021/03/02 19:55:05] DEBUG [Thread-194] [zettarepl.paramiko.replication_task__task_3] [chan 60] EOF received (60)

[2021/03/02 19:55:05] DEBUG [Thread-194] [zettarepl.paramiko.replication_task__task_3] [chan 60] EOF sent (60)

[2021/03/02 19:55:05] DEBUG [replication_task__task_3] [zettarepl.transport.base_ssh.root@vis-backup.shell.95.async_exec.5028] Waiting for exit status

[2021/03/02 19:55:05] DEBUG [replication_task__task_3] [zettarepl.transport.base_ssh.root@vis-backup.shell.95.async_exec.5028] Error 1: "cannot unmount 'vol': not currently mounted\n"

[2021/03/02 19:55:05] DEBUG [replication_task__task_3] [zettarepl.transport.local.shell.1.async_exec.5029] Running ['zfs', 'send', '-V']

[2021/03/02 19:55:05] DEBUG [replication_task__task_3] [zettarepl.transport.local.shell.1.async_exec.5029] Error 2: 'missing snapshot argument\nusag.... list, run: zfs allow|unallow\n'

[2021/03/02 19:55:05] DEBUG [replication_task__task_3] [zettarepl.transport.local.shell.1.async_exec.5031] Running ['sh', '-c', 'exec 3>&1; eval $(exec 4>&1 >&....] && exit $pipestatus1; exit 0']

[2021/03/02 19:55:06] DEBUG [replication_task__task_3.async_exec_tee.wait] [zettarepl.transport.local.shell.1.async_exec.5031] Error 141: None

[2021/03/02 19:55:06] DEBUG [replication_task__task_3.process] [zettarepl.transport.local.shell.1.async_exec.5030] Error 141: 'No ECDSA host key is known for....Host key verification failed.\n'

[2021/03/02 19:55:06] DEBUG [replication_task__task_3.monitor] [zettarepl.transport.local.shell.1.async_exec.5031] Stopping

[2021/03/02 19:55:06] WARNING [replication_task__task_3] [zettarepl.replication.run] For task 'task_3' at attempt 1 recoverable replication error RecoverableReplicationError('Broken pipe (No ECDSA host key is known for vis-backup.an.intel.com and you have requested str

ict checking.\nHost key verification failed.\n)')

[2021/03/02 19:55:06] ERROR [replication_task__task_3] [zettarepl.replication.run] Failed replication task 'task_3' after 1 retries

[2021/03/02 19:55:06] DEBUG [Thread-194] [zettarepl.paramiko.replication_task__task_3] EOF in transport thread

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

George Kyriazis 
July 10, 2021 at 2:31 AM

Still exists in 12.0-U4.

Any updates on when it will be fixed?

Thanks!

 

George Kyriazis 
March 9, 2021 at 2:38 PM

Changing to /usr/local/bin/ssh does not change the behavior.  ssh still works with no complaints about host key.

David Pesticcio 
March 9, 2021 at 1:56 PM
(edited)

After reading the original conversation of this ticket more closely, I'd like to clarify a few things. (The UI was so slow, you've partially implemented one of my suggestions in the meantime!)  

 

SSH from "Replication Task" will fail when:

  1. The "SSH Connections" entry for <hostname> "remote host key" is not as required

 

SSH from the command line works when:

  1. Putting the private key from "SSH Keypairs" into ~/.ssh/test_private.key

  2. ssh -i ~/.ssh/test_private.key <hostname>

  3. Accept the host key

  4. ~/.ssh/known_hosts file gets updated with the correct key

  5. The ssh connection works just fine

  6. The problem appears to be:

 

"SSH Connections" does not "discover" the appropriate host key that the "Replication Task" requires. 

  1. Hitting discover hundreds of times does NOT yield the correct key in a dependable/reliable/useful way.

  2. Inconsistency with how the "SSH Connection" host keys are discovered, and how they are used throughout TrueNAS

 

Possible Solution for "SSH Connections":

  1. Discover and display all keys, and let the user choose from the list

  2. Have TrueNAS choose a key based upon a pre-defined ordered list

  3. Fix the TrueNAS usage inconsistencies

 

BTW: An added bonus bug - even though you can add a comment to the "remote host key" text-box, and save it, you will not be able to view the logs, or edit the Source/Destination in the UI for the corresponding "Replication Task" that uses that SSH key. (I'll check if that's already been reported, and create a ticket if not. )

https://jira.ixsystems.com/browse/NAS-109730

 

The error:

Bug Clerk 
March 9, 2021 at 1:48 PM

Bug Clerk 
March 9, 2021 at 1:15 PM

Complete

Details

Assignee

Reporter

Labels

Impact

Components

Affects versions

Priority

More fields

Katalon Platform

Created March 3, 2021 at 3:29 AM
Updated July 1, 2022 at 5:13 PM
Resolved March 9, 2021 at 1:57 PM