Skip to content

Conversation

akrzos
Copy link
Member

@akrzos akrzos commented Aug 6, 2025

No description provided.

Copy link

openshift-ci bot commented Aug 6, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link

openshift-ci bot commented Aug 6, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign josecastillolema for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@akrzos
Copy link
Member Author

akrzos commented Aug 6, 2025

/test ?

Copy link

openshift-ci bot commented Aug 6, 2025

@akrzos: The following commands are available to trigger required jobs:

/test deploy-compact
/test deploy-mno
/test deploy-mno-hybrid
/test deploy-sno
/test deploy-sno-self-sched

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@akrzos
Copy link
Member Author

akrzos commented Aug 6, 2025

/test deploy-sno-self-sched

@akrzos
Copy link
Member Author

akrzos commented Aug 8, 2025

Seems it failed on timeout:

Process did not finish before Xh0m0s timeout

Would be nice if it didn't obscure what the timeout is.

@akrzos
Copy link
Member Author

akrzos commented Aug 8, 2025

/test deploy-sno-self-sched

@akrzos
Copy link
Member Author

akrzos commented Aug 11, 2025

cc @josecastillolema What is the prow timeout for the self-schedule job?

@akrzos
Copy link
Member Author

akrzos commented Aug 11, 2025

/test deploy-sno

1 similar comment
@akrzos
Copy link
Member Author

akrzos commented Aug 11, 2025

/test deploy-sno

@akrzos
Copy link
Member Author

akrzos commented Aug 11, 2025

/test deploy-mno

@josecastillolema
Copy link
Member

Im back! Let's see:

 Resetting IDRAC of server f19-h03-000-rXX0.rduX.XXXXXXXX.redhat.com ...
Traceback (most recent call last):
  File "/usr/local/bin/badfish", line 7, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/badfish/main.py", line 30X1, in main
    _host, result = loop.run_until_complete(
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line XXX, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.9/site-packages/badfish/main.py", line XXX9, in execute_badfish
    badfish = await badfish_factory(
  File "/usr/local/lib/python3.9/site-packages/badfish/main.py", line X1, in badfish_factory
    await badfish.init()
  File "/usr/local/lib/python3.9/site-packages/badfish/main.py", line 7X, in init
    self.system_resource = await self.find_systems_resource()
  File "/usr/local/lib/python3.9/site-packages/badfish/main.py", line XX1, in find_systems_resource
    data = json.loads(raw.strip())
  File "/usr/local/lib/python3.9/json/__init__.py", line 3XX, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.9/json/decoder.py", line 3XX, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) 

@akrzos I think the default timeout per step is 2 hours

@josecastillolema
Copy link
Member

/test deploy-sno

@josecastillolema
Copy link
Member

Again:

 ✓ IDRAC for server f19-h0X-000-rXX0.rduX.XXXXXXXX.redhat.com is ready
IDRAC reset and readiness check completed
Clearing job queue ...
Clear job queue of server f19-h0X-000-rXX0.rduX.XXXXXXXX.redhat.com ...
- ERROR    - Failed to communicate with mgmt-f19-h0X-000-rXX0.rduX.XXXXXXXX.redhat.com 

Maybe we should flip the order?
First clear job queue and then IDRAC reset?
Or with the IDRAC reset maybe we don't need to clear the job at all?
Wdyt @mcornea ?

@mcornea
Copy link
Collaborator

mcornea commented Aug 12, 2025

Again:

 ✓ IDRAC for server f19-h0X-000-rXX0.rduX.XXXXXXXX.redhat.com is ready
IDRAC reset and readiness check completed
Clearing job queue ...
Clear job queue of server f19-h0X-000-rXX0.rduX.XXXXXXXX.redhat.com ...
- ERROR    - Failed to communicate with mgmt-f19-h0X-000-rXX0.rduX.XXXXXXXX.redhat.com 

Maybe we should flip the order? First clear job queue and then IDRAC reset? Or with the IDRAC reset maybe we don't need to clear the job at all? Wdyt @mcornea ?

The error in this specific case looks like a race condition as there are no retries waiting for the IDRAC to become ready which means that the reboot didn't even start when the ready check ran. I sent openshift/release#68061 trying to address this issue.

@mcornea
Copy link
Collaborator

mcornea commented Aug 13, 2025

/test deploy-sno

Copy link

openshift-ci bot commented Aug 13, 2025

@akrzos: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/deploy-sno-self-sched 3969d7d link true /test deploy-sno-self-sched

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@josecastillolema
Copy link
Member

fatal: [f19-h01-000-rXX0.rduX.XXXXXXXX.redhat.com]: FAILED! => {"msg": "The conditional check 'cluster.json.hosts | length == inventory_nodes | length' failed. The error was: error while evaluating conditional (cluster.json.hosts | length == inventory_nodes | length): 'dict object' has no attribute 'json'. 'dict object' has no attribute 'json'"}

Any ideas?

@mcornea
Copy link
Collaborator

mcornea commented Aug 13, 2025

fatal: [f19-h01-000-rXX0.rduX.XXXXXXXX.redhat.com]: FAILED! => {"msg": "The conditional check 'cluster.json.hosts | length == inventory_nodes | length' failed. The error was: error while evaluating conditional (cluster.json.hosts | length == inventory_nodes | length): 'dict object' has no attribute 'json'. 'dict object' has no attribute 'json'"}

Any ideas?

I'd say this means the URL didn't return a json at the time the task was executed. Let try to re-run to confirm it's a race and we can make the task more robust, e.g extending the until condition with cluster.json is defined

@mcornea
Copy link
Collaborator

mcornea commented Aug 13, 2025

/test deploy-sno

@mcornea
Copy link
Collaborator

mcornea commented Aug 13, 2025

fatal: [f19-h01-000-rXX0.rduX.XXXXXXXX.redhat.com]: FAILED! => {"msg": "The conditional check 'cluster.json.hosts | length == inventory_nodes | length' failed. The error was: error while evaluating conditional (cluster.json.hosts | length == inventory_nodes | length): 'dict object' has no attribute 'json'. 'dict object' has no attribute 'json'"}

Any ideas?

I'd say this means the URL didn't return a json at the time the task was executed. Let try to re-run to confirm it's a race and we can make the task more robust, e.g extending the until condition with cluster.json is defined

Looks like the issue didn't reproduce this time. I sent #681 to extend the wait condition.

@josecastillolema
Copy link
Member

/test deploy-mno

@josecastillolema
Copy link
Member

josecastillolema commented Aug 13, 2025

I was thinking, in order to test both paths of this PR, once we get both the "normal" deploy-sno and deploy-mno working (happy path) we could just mount a virtual media in the jetlag CI cluster, wdyt @akrzos ?

@akrzos akrzos marked this pull request as ready for review August 14, 2025 12:27
@openshift-ci openshift-ci bot requested a review from rsevilla87 August 14, 2025 12:27
@akrzos akrzos force-pushed the fix_stuck_virtual_media branch from 3969d7d to 1a768ed Compare August 14, 2025 13:05
@akrzos akrzos force-pushed the fix_stuck_virtual_media branch from 1a768ed to 838ac45 Compare August 15, 2025 12:41
@akrzos
Copy link
Member Author

akrzos commented Aug 18, 2025

@josecastillolema any further in testing and validating this?

@akrzos akrzos force-pushed the fix_stuck_virtual_media branch from 838ac45 to 90c1c71 Compare August 18, 2025 13:25
@josecastillolema
Copy link
Member

Testing it from: openshift/release#67885

@akrzos akrzos force-pushed the fix_stuck_virtual_media branch 2 times, most recently from d7eb2f9 to f7eec43 Compare August 21, 2025 12:32
@akrzos
Copy link
Member Author

akrzos commented Sep 4, 2025

Hi @josecastillolema What is the status of your testing of this PR?

@josecastillolema
Copy link
Member

Yeah was waiting for you to come back.
This was the latest run, from the logs are you able to tell if the workaround implemented in this PR worked properly?

The deploy failed because of an unrelated reason on a latter task.

@akrzos
Copy link
Member Author

akrzos commented Sep 4, 2025

Yeah was waiting for you to come back. This was the latest run, from the logs are you able to tell if the workaround implemented in this PR worked properly?

The deploy failed because of an unrelated reason on a latter task.

In the future, feel free to not wait for me to determine if the patch ran or not.

I looked at the output and searched for the newly added tasks and found this:

TASK [boot-iso : Dell - Eject any CD Virtual Media] ****************************
Thursday X1 August X0X5  1X:13:XX +0000 (0:00:15.X81)       0:05:35.878 ******* 
fatal: [fXX-h11-000-rX30.rduX.XXXXXXXX.redhat.com]: FAILED! => {"accept_ranges": "bytes", "cache_control": "no-cache", "changed": false, "connection": "close", "content": "{\"error\":{\"@Message.ExtendedInfo\":[{\"Message\":\"No Virtual Media devices are currently connected.\",\"MessageArgs\":[],\"[email protected]\":0,\"MessageId\":\"IDRAC.1.X.VRM0009\",\"RelatedProperties\":[],\"[email protected]\":0,\"Resolution\":\"No response action is required.\",\"Severity\":\"Critical\"},{\"Message\":\"The request failed due to an internal service error.  The service is still operational.\",\"MessageArgs\":[],\"[email protected]\":0,\"MessageId\":\"Base.1.X.InternalError\",\"RelatedProperties\":[],\"[email protected]\":0,\"Resolution\":\"Resubmit the request.  If the problem persists, consider resetting the service.\",\"Severity\":\"Critical\"}],\"code\":\"Base.1.X.GeneralError\",\"message\":\"A general error has occurred. See ExtendedInfo for more information\"}}\n", "content_length": "77X", "content_type": "application/json;odata.metadata=minimal;charset=utf-8", "date": "Thu, X1 Aug X0X5 19:13:50 GMT", "elapsed": 8, "json": {"error": {"@Message.ExtendedInfo": [{"Message": "No Virtual Media devices are currently connected.", "MessageArgs": [], "[email protected]": 0, "MessageId": "IDRAC.1.X.VRM0009", "RelatedProperties": [], "[email protected]": 0, "Resolution": "No response action is required.", "Severity": "Critical"}, {"Message": "The request failed due to an internal service error.  The service is still operational.", "MessageArgs": [], "[email protected]": 0, "MessageId": "Base.1.X.InternalError", "RelatedProperties": [], "[email protected]": 0, "Resolution": "Resubmit the request.  If the problem persists, consider resetting the service.", "Severity": "Critical"}], "code": "Base.1.X.GeneralError", "message": "A general error has occurred. See ExtendedInfo for more information"}}, "msg": "Status code was 500 and not [X0X]: HTTP Error 500: Internal Server Error", "odata_version": "X.0", "redirected": false, "server": "iDRAC/8", "status": 500, "strict_transport_security": "max-age=X307X000", "url": "https://mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com/redfish/v1/Managers/iDRAC.Embedded.1/VirtualMedia/CD/Actions/VirtualMedia.EjectMedia", "vary": "Accept-Encoding", "x_frame_options": "SAMEORIGIN"}

TASK [boot-iso : Force mount of a existing image] ******************************
Thursday X1 August X0X5  1X:13:50 +0000 (0:00:08.X78)       0:05:XX.557 ******* 
changed: [fXX-h11-000-rX30.rduX.XXXXXXXX.redhat.com -> mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com] => {"changed": true, "rc": 0, "stderr": "Warning: Permanently added 'mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com' (ECDSA) to the list of known hosts.\r\nShared connection to mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com closed.\r\n", "stderr_lines": ["Warning: Permanently added 'mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com' (ECDSA) to the list of known hosts.", "Shared connection to mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com closed."], "stdout": "Remote Image is now Configured\r\n", "stdout_lines": ["Remote Image is now Configured"]}

TASK [boot-iso : Force unmount of the existing image] **************************
Thursday X1 August X0X5  1X:13:5X +0000 (0:00:05.89X)       0:05:50.XX9 ******* 
changed: [fXX-h11-000-rX30.rduX.XXXXXXXX.redhat.com -> mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com] => {"changed": true, "rc": 0, "stderr": "Shared connection to mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com closed.\r\n", "stderr_lines": ["Shared connection to mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com closed."], "stdout": "Disable Remote File Started. Please check status using -s\r\noption to know Remote File Share is ENABLED or DISABLED.\r\n", "stdout_lines": ["Disable Remote File Started. Please check status using -s", "option to know Remote File Share is ENABLED or DISABLED."]}

I see the initial failure for task boot-iso : Dell - Eject any CD Virtual Media and then we see the two new rescue tasks were executed... boot-iso : Force mount of a existing image and boot-iso : Force unmount of the existing image . It seems to me they are working properly.

@josecastillolema
Copy link
Member

Then this PR lgtm! I think we are good to merge

status_code: 204
return_content: yes

# # Eject just the found image
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we delete this commented block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, only left it in there if there was a case we needed it but I guess years later we still didn't use it.

@akrzos akrzos force-pushed the fix_stuck_virtual_media branch from f7eec43 to 23b93e0 Compare September 5, 2025 12:45
@akrzos akrzos force-pushed the fix_stuck_virtual_media branch from 23b93e0 to e3b8a5c Compare September 5, 2025 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants