-
Notifications
You must be signed in to change notification settings - Fork 50
Use a prefix to identify/cleanup jetlag managed nmcli connections #719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
akrzos
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me, I'll try and give this a run in my environment next time I rebuild.
| - name: Set jetlag connection prefix | ||
| set_fact: | ||
| jetlag_conn_prefix: "jetlag-" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we define this in a vars file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, tnx! I added it to bastion-network/defaults/main/networks.yml which seems to be a symlink to create-inventory/defaults/main/networks.yml Please let me know if I should create a separate file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add it to network.yml under create-inventory. We tried to put the "base" var files under create-inventory defaults/main Ex dns, networks, storage as it was the first role that should be ran aside from validate-vars. Symlinking was done to avoid duplicating the files everywhere. Original jetlag was made such that you could comment out a role if your deployment failed on a specific step. Hope that helps explain how/where there is symlinking of these vars files.
41f6060 to
33ad1c3
Compare
ec0edc2 to
307e901
Compare
307e901 to
04d7da8
Compare
04d7da8 to
e1c29c1
Compare
| - name: Remove existing jetlag-managed NetworkManager connections | ||
| nmcli: | ||
| conn_name: "{{ item }}" | ||
| state: absent | ||
| loop: "{{ jetlag_connections }}" | ||
| when: jetlag_connections | length > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem with this is it will remove the connection that likely the bastion is using for ssh (When you run on the bastion yourself). While that might not effect CI if the playbook is running from elsewhere it will effect most other setups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect the bastion to use the bastion_lab_interface for SSH connections. This interface isn't removed in this case as we only mark the bastion_controlplane_interface and the bond connections with the prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That just is not what occurs though, in dns we make the bastion resolve to the address assigned to bastion_controlplane_interface so future ssh connections occur to this address so a rerun will encounter this issue, I actually tested this already and confirmed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, but in case of a rerun the bastion_controlplane_interface connection would be restored in the Setup bastion on control-plane network tasks so I'd expect the future ssh connection to be using that.
One thing that comes to mind is that when running these tasks for the first time on an already configured system we'll need to do a manual cleanup of the existing bastion_controlplane_interface connection which doesn't use the prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we probably need come up with a cleaner method to all of this network configuration because for single-interface there already is a cleanup a prior configuration but only on the selected interface. IIRC this was needed for initial allocations so that we could properly configure the interface to begin with. Where it became messy was when folks selected the wrong bastion controlplane interface then changed it mid-allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to consolidate the cleanup tasks so I included the connections for all non-lab interfaces in the cleanup at the beginning of the play. Please let me know wdyt and if it's too agressive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still gets stuck when I run it directly from a bastion machine:
TASK [bastion-network : Remove NetworkManager connections for clean reconfiguration] *********************************************************************************************************
Friday 24 October 2025 12:48:15 +0000 (0:00:00.040) 0:00:43.040 ********
(Stuck here)nmcli before stuck:
# nmcli c
NAME UUID TYPE DEVICE
eno12399np0 1c6ae39a-76b3-4abb-985e-bcadd512b69e ethernet eno12399np0
ens1f0 f0188f8b-0f5c-44a7-87f6-2ed141f6858b ethernet ens1f0
lo 48696d42-e906-4b6d-a61c-7e6ede97bdd5 loopback lo
eno12399np0 5da5d1e4-e608-86ee-04c4-28f6e7b3513c ethernet --
eno8303 a9b3121b-8fe7-4ede-98b7-ca092fd28e99 ethernet --
eno8403 9fc739a5-2ecf-4fd3-9058-b587ca7648c6 ethernet --
Wired connection 1 c6fd8a75-44e2-30b7-abbe-cdf6c74af19f ethernet --
Wired connection 3 0fa14b90-c275-30d0-bea0-5a1f4ab054bf ethernet --
Wired connection 4 94baa868-1c40-3857-a0f6-bf4ec421f2c0 ethernet --
Wired connection 5 244337cd-a706-3305-b0c7-1f5bc66777f5 ethernet -- nmcli during stuck:
# nmcli c
NAME UUID TYPE DEVICE
eno12399np0 1c6ae39a-76b3-4abb-985e-bcadd512b69e ethernet eno12399np0
Wired connection 1 c6fd8a75-44e2-30b7-abbe-cdf6c74af19f ethernet eno12409np1
Wired connection 3 0fa14b90-c275-30d0-bea0-5a1f4ab054bf ethernet ens1f1
Wired connection 4 94baa868-1c40-3857-a0f6-bf4ec421f2c0 ethernet ens2f0np0
Wired connection 5 244337cd-a706-3305-b0c7-1f5bc66777f5 ethernet ens2f1np1
lo 48696d42-e906-4b6d-a61c-7e6ede97bdd5 loopback lo
eno12399np0 5da5d1e4-e608-86ee-04c4-28f6e7b3513c ethernet --
eno8303 a9b3121b-8fe7-4ede-98b7-ca092fd28e99 ethernet --
eno8403 9fc739a5-2ecf-4fd3-9058-b587ca7648c6 ethernet -- Interface ens1f0 is the bastion_controlplane_interface. So the issue remains that anyone attempting to rerun setup-bastion from the bastion will get stuck.
It gets a bit worse, because if you control-c and attempt to rerun, you now get stuck at the beginning for fact gathering.
PLAY [Setup bastion machine] *****************************************************************************************************************************************************************
TASK [Gathering Facts] ***********************************************************************************************************************************************************************
Friday 24 October 2025 12:57:47 +0000 (0:00:00.043) 0:00:00.043 ********
(Stuck)In order to unjam it I had to remove this ansible directory rm -rf /root/.ansible/cp/, afterwards I could finally rerun setup-bastion and now Network Manager shows the following connections:
# nmcli c
NAME UUID TYPE DEVICE
eno12399np0 1c6ae39a-76b3-4abb-985e-bcadd512b69e ethernet eno12399np0
jetlag-ens1f0 2bfb6773-9e4f-482c-9ff2-db8c18a4687c ethernet ens1f0
podman1 5d7e1f8c-e636-4860-8ebc-6e1f74d61a8b bridge podman1
lo 48696d42-e906-4b6d-a61c-7e6ede97bdd5 loopback lo
eno12399np0 5da5d1e4-e608-86ee-04c4-28f6e7b3513c ethernet --
eno8303 a9b3121b-8fe7-4ede-98b7-ca092fd28e99 ethernet --
eno8403 9fc739a5-2ecf-4fd3-9058-b587ca7648c6 ethernet --
Wired connection 1 c6fd8a75-44e2-30b7-abbe-cdf6c74af19f ethernet --
Wired connection 2 9c3a3977-810e-32c7-98fe-b9dc4842e2f3 ethernet --
Wired connection 4 94baa868-1c40-3857-a0f6-bf4ec421f2c0 ethernet --
Wired connection 5 244337cd-a706-3305-b0c7-1f5bc66777f5 ethernet -- Also any subseqent rerun of setup-bastion still gets stuck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, now I understand and I was able to reproduce the issue - when the ssh connection used for running ansible-playbook is established over the bastion_controlplane_ip it gets stuck because the playbook removes the underlying connection. To deal with this scenario I updated the cleanup steps to:
-
remove the local nameserver from resolv.conf so that subsequent calls resolve to the lab ip address
-
close any ssh connections established over the
bastion_controlplane_ipif they exist and point the user to re-run the playbook in this case
This is not ideal, but I think the other alternative would be to use a single shell task to deal with both bastion_controlplane_interface cleanup and re-creation which could become more complex to maintain.
Please let me know what you think.
234beac to
767ff0f
Compare
Signed-off-by: Marius Cornea <[email protected]>
767ff0f to
12b189a
Compare
No description provided.