Skip to content

Conversation

@mcornea
Copy link
Collaborator

@mcornea mcornea commented Oct 14, 2025

No description provided.

@openshift-ci openshift-ci bot requested review from akrzos and jtaleric October 14, 2025 07:42
@openshift-ci
Copy link

openshift-ci bot commented Oct 14, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rsevilla87 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@akrzos akrzos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me, I'll try and give this a run in my environment next time I rebuild.

Comment on lines 4 to 6
- name: Set jetlag connection prefix
set_fact:
jetlag_conn_prefix: "jetlag-"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we define this in a vars file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, tnx! I added it to bastion-network/defaults/main/networks.yml which seems to be a symlink to create-inventory/defaults/main/networks.yml Please let me know if I should create a separate file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add it to network.yml under create-inventory. We tried to put the "base" var files under create-inventory defaults/main Ex dns, networks, storage as it was the first role that should be ran aside from validate-vars. Symlinking was done to avoid duplicating the files everywhere. Original jetlag was made such that you could comment out a role if your deployment failed on a specific step. Hope that helps explain how/where there is symlinking of these vars files.

Comment on lines 13 to 18
- name: Remove existing jetlag-managed NetworkManager connections
nmcli:
conn_name: "{{ item }}"
state: absent
loop: "{{ jetlag_connections }}"
when: jetlag_connections | length > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem with this is it will remove the connection that likely the bastion is using for ssh (When you run on the bastion yourself). While that might not effect CI if the playbook is running from elsewhere it will effect most other setups.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect the bastion to use the bastion_lab_interface for SSH connections. This interface isn't removed in this case as we only mark the bastion_controlplane_interface and the bond connections with the prefix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That just is not what occurs though, in dns we make the bastion resolve to the address assigned to bastion_controlplane_interface so future ssh connections occur to this address so a rerun will encounter this issue, I actually tested this already and confirmed it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, but in case of a rerun the bastion_controlplane_interface connection would be restored in the Setup bastion on control-plane network tasks so I'd expect the future ssh connection to be using that.

One thing that comes to mind is that when running these tasks for the first time on an already configured system we'll need to do a manual cleanup of the existing bastion_controlplane_interface connection which doesn't use the prefix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we probably need come up with a cleaner method to all of this network configuration because for single-interface there already is a cleanup a prior configuration but only on the selected interface. IIRC this was needed for initial allocations so that we could properly configure the interface to begin with. Where it became messy was when folks selected the wrong bastion controlplane interface then changed it mid-allocation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to consolidate the cleanup tasks so I included the connections for all non-lab interfaces in the cleanup at the beginning of the play. Please let me know wdyt and if it's too agressive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still gets stuck when I run it directly from a bastion machine:

TASK [bastion-network : Remove NetworkManager connections for clean reconfiguration] *********************************************************************************************************
Friday 24 October 2025  12:48:15 +0000 (0:00:00.040)       0:00:43.040 ******** 
(Stuck here)

nmcli before stuck:

# nmcli c
NAME                UUID                                  TYPE      DEVICE      
eno12399np0         1c6ae39a-76b3-4abb-985e-bcadd512b69e  ethernet  eno12399np0 
ens1f0              f0188f8b-0f5c-44a7-87f6-2ed141f6858b  ethernet  ens1f0      
lo                  48696d42-e906-4b6d-a61c-7e6ede97bdd5  loopback  lo          
eno12399np0         5da5d1e4-e608-86ee-04c4-28f6e7b3513c  ethernet  --          
eno8303             a9b3121b-8fe7-4ede-98b7-ca092fd28e99  ethernet  --          
eno8403             9fc739a5-2ecf-4fd3-9058-b587ca7648c6  ethernet  --          
Wired connection 1  c6fd8a75-44e2-30b7-abbe-cdf6c74af19f  ethernet  --          
Wired connection 3  0fa14b90-c275-30d0-bea0-5a1f4ab054bf  ethernet  --          
Wired connection 4  94baa868-1c40-3857-a0f6-bf4ec421f2c0  ethernet  --          
Wired connection 5  244337cd-a706-3305-b0c7-1f5bc66777f5  ethernet  --          

nmcli during stuck:

# nmcli c
NAME                UUID                                  TYPE      DEVICE      
eno12399np0         1c6ae39a-76b3-4abb-985e-bcadd512b69e  ethernet  eno12399np0 
Wired connection 1  c6fd8a75-44e2-30b7-abbe-cdf6c74af19f  ethernet  eno12409np1 
Wired connection 3  0fa14b90-c275-30d0-bea0-5a1f4ab054bf  ethernet  ens1f1      
Wired connection 4  94baa868-1c40-3857-a0f6-bf4ec421f2c0  ethernet  ens2f0np0   
Wired connection 5  244337cd-a706-3305-b0c7-1f5bc66777f5  ethernet  ens2f1np1   
lo                  48696d42-e906-4b6d-a61c-7e6ede97bdd5  loopback  lo          
eno12399np0         5da5d1e4-e608-86ee-04c4-28f6e7b3513c  ethernet  --          
eno8303             a9b3121b-8fe7-4ede-98b7-ca092fd28e99  ethernet  --          
eno8403             9fc739a5-2ecf-4fd3-9058-b587ca7648c6  ethernet  -- 

Interface ens1f0 is the bastion_controlplane_interface. So the issue remains that anyone attempting to rerun setup-bastion from the bastion will get stuck.

It gets a bit worse, because if you control-c and attempt to rerun, you now get stuck at the beginning for fact gathering.

PLAY [Setup bastion machine] *****************************************************************************************************************************************************************

TASK [Gathering Facts] ***********************************************************************************************************************************************************************
Friday 24 October 2025  12:57:47 +0000 (0:00:00.043)       0:00:00.043 ******** 
(Stuck)

In order to unjam it I had to remove this ansible directory rm -rf /root/.ansible/cp/, afterwards I could finally rerun setup-bastion and now Network Manager shows the following connections:

# nmcli c
NAME                UUID                                  TYPE      DEVICE      
eno12399np0         1c6ae39a-76b3-4abb-985e-bcadd512b69e  ethernet  eno12399np0 
jetlag-ens1f0       2bfb6773-9e4f-482c-9ff2-db8c18a4687c  ethernet  ens1f0      
podman1             5d7e1f8c-e636-4860-8ebc-6e1f74d61a8b  bridge    podman1     
lo                  48696d42-e906-4b6d-a61c-7e6ede97bdd5  loopback  lo          
eno12399np0         5da5d1e4-e608-86ee-04c4-28f6e7b3513c  ethernet  --          
eno8303             a9b3121b-8fe7-4ede-98b7-ca092fd28e99  ethernet  --          
eno8403             9fc739a5-2ecf-4fd3-9058-b587ca7648c6  ethernet  --          
Wired connection 1  c6fd8a75-44e2-30b7-abbe-cdf6c74af19f  ethernet  --          
Wired connection 2  9c3a3977-810e-32c7-98fe-b9dc4842e2f3  ethernet  --          
Wired connection 4  94baa868-1c40-3857-a0f6-bf4ec421f2c0  ethernet  --          
Wired connection 5  244337cd-a706-3305-b0c7-1f5bc66777f5  ethernet  -- 

Also any subseqent rerun of setup-bastion still gets stuck.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now I understand and I was able to reproduce the issue - when the ssh connection used for running ansible-playbook is established over the bastion_controlplane_ip it gets stuck because the playbook removes the underlying connection. To deal with this scenario I updated the cleanup steps to:

  • remove the local nameserver from resolv.conf so that subsequent calls resolve to the lab ip address

  • close any ssh connections established over the bastion_controlplane_ip if they exist and point the user to re-run the playbook in this case

This is not ideal, but I think the other alternative would be to use a single shell task to deal with both bastion_controlplane_interface cleanup and re-creation which could become more complex to maintain.

Please let me know what you think.

@mcornea mcornea force-pushed the nmcli_cleanup branch 2 times, most recently from 234beac to 767ff0f Compare October 17, 2025 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants