Address stuck virtual media #679

akrzos · 2025-08-06T12:50:25Z

No description provided.

openshift-ci · 2025-08-06T12:50:30Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2025-08-06T12:50:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign josecastillolema for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

akrzos · 2025-08-06T19:34:11Z

/test ?

openshift-ci · 2025-08-06T19:34:15Z

@akrzos: The following commands are available to trigger required jobs:

/test deploy-compact

/test deploy-mno

/test deploy-mno-hybrid

/test deploy-sno

/test deploy-sno-self-sched

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

akrzos · 2025-08-06T19:34:27Z

/test deploy-sno-self-sched

akrzos · 2025-08-08T12:59:08Z

Seems it failed on timeout:

Process did not finish before Xh0m0s timeout

Would be nice if it didn't obscure what the timeout is.

akrzos · 2025-08-08T12:59:17Z

/test deploy-sno-self-sched

akrzos · 2025-08-11T12:12:39Z

cc @josecastillolema What is the prow timeout for the self-schedule job?

akrzos · 2025-08-11T12:22:31Z

/test deploy-sno

akrzos · 2025-08-11T12:31:46Z

/test deploy-sno

akrzos · 2025-08-11T17:51:53Z

/test deploy-mno

josecastillolema · 2025-08-12T08:01:23Z

Im back! Let's see:

ci/prow/deploy-sno-self-sched got stuck in wait-hosts-discovered
ci/prow/deploy-sno failed trying to clear the IDRAC job queue
ci/prow/deploy-mno failed resetting the IDRAC (I don't recall seeing this error before):

 Resetting IDRAC of server f19-h03-000-rXX0.rduX.XXXXXXXX.redhat.com ...
Traceback (most recent call last):
  File "/usr/local/bin/badfish", line 7, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/badfish/main.py", line 30X1, in main
    _host, result = loop.run_until_complete(
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line XXX, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.9/site-packages/badfish/main.py", line XXX9, in execute_badfish
    badfish = await badfish_factory(
  File "/usr/local/lib/python3.9/site-packages/badfish/main.py", line X1, in badfish_factory
    await badfish.init()
  File "/usr/local/lib/python3.9/site-packages/badfish/main.py", line 7X, in init
    self.system_resource = await self.find_systems_resource()
  File "/usr/local/lib/python3.9/site-packages/badfish/main.py", line XX1, in find_systems_resource
    data = json.loads(raw.strip())
  File "/usr/local/lib/python3.9/json/__init__.py", line 3XX, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.9/json/decoder.py", line 3XX, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

@akrzos I think the default timeout per step is 2 hours

josecastillolema · 2025-08-12T08:01:31Z

/test deploy-sno

josecastillolema · 2025-08-12T10:51:20Z

Again:

 ✓ IDRAC for server f19-h0X-000-rXX0.rduX.XXXXXXXX.redhat.com is ready
IDRAC reset and readiness check completed
Clearing job queue ...
Clear job queue of server f19-h0X-000-rXX0.rduX.XXXXXXXX.redhat.com ...
- ERROR    - Failed to communicate with mgmt-f19-h0X-000-rXX0.rduX.XXXXXXXX.redhat.com

Maybe we should flip the order?
First clear job queue and then IDRAC reset?
Or with the IDRAC reset maybe we don't need to clear the job at all?
Wdyt @mcornea ?

mcornea · 2025-08-12T13:09:45Z

Again:
 ✓ IDRAC for server f19-h0X-000-rXX0.rduX.XXXXXXXX.redhat.com is ready
IDRAC reset and readiness check completed
Clearing job queue ...
Clear job queue of server f19-h0X-000-rXX0.rduX.XXXXXXXX.redhat.com ...
- ERROR    - Failed to communicate with mgmt-f19-h0X-000-rXX0.rduX.XXXXXXXX.redhat.com 
Maybe we should flip the order? First clear job queue and then IDRAC reset? Or with the IDRAC reset maybe we don't need to clear the job at all? Wdyt @mcornea ?

The error in this specific case looks like a race condition as there are no retries waiting for the IDRAC to become ready which means that the reboot didn't even start when the ready check ran. I sent openshift/release#68061 trying to address this issue.

mcornea · 2025-08-13T08:45:44Z

/test deploy-sno

openshift-ci · 2025-08-13T09:01:20Z

@akrzos: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/deploy-sno-self-sched	`3969d7d`	link	true	`/test deploy-sno-self-sched`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

josecastillolema · 2025-08-13T09:31:02Z

fatal: [f19-h01-000-rXX0.rduX.XXXXXXXX.redhat.com]: FAILED! => {"msg": "The conditional check 'cluster.json.hosts | length == inventory_nodes | length' failed. The error was: error while evaluating conditional (cluster.json.hosts | length == inventory_nodes | length): 'dict object' has no attribute 'json'. 'dict object' has no attribute 'json'"}

Any ideas?

mcornea · 2025-08-13T09:42:09Z

fatal: [f19-h01-000-rXX0.rduX.XXXXXXXX.redhat.com]: FAILED! => {"msg": "The conditional check 'cluster.json.hosts | length == inventory_nodes | length' failed. The error was: error while evaluating conditional (cluster.json.hosts | length == inventory_nodes | length): 'dict object' has no attribute 'json'. 'dict object' has no attribute 'json'"}

Any ideas?

I'd say this means the URL didn't return a json at the time the task was executed. Let try to re-run to confirm it's a race and we can make the task more robust, e.g extending the until condition with cluster.json is defined

mcornea · 2025-08-13T09:42:39Z

/test deploy-sno

mcornea · 2025-08-13T10:26:37Z

fatal: [f19-h01-000-rXX0.rduX.XXXXXXXX.redhat.com]: FAILED! => {"msg": "The conditional check 'cluster.json.hosts | length == inventory_nodes | length' failed. The error was: error while evaluating conditional (cluster.json.hosts | length == inventory_nodes | length): 'dict object' has no attribute 'json'. 'dict object' has no attribute 'json'"}
Any ideas?
I'd say this means the URL didn't return a json at the time the task was executed. Let try to re-run to confirm it's a race and we can make the task more robust, e.g extending the until condition with cluster.json is defined

Looks like the issue didn't reproduce this time. I sent #681 to extend the wait condition.

josecastillolema · 2025-08-13T12:12:23Z

/test deploy-mno

josecastillolema · 2025-08-13T12:13:10Z

I was thinking, in order to test both paths of this PR, once we get both the "normal" deploy-sno and deploy-mno working (happy path) we could just mount a virtual media in the jetlag CI cluster, wdyt @akrzos ?

akrzos · 2025-08-18T13:24:37Z

@josecastillolema any further in testing and validating this?

josecastillolema · 2025-08-20T12:00:34Z

Testing it from: openshift/release#67885

akrzos · 2025-09-04T14:41:52Z

Hi @josecastillolema What is the status of your testing of this PR?

josecastillolema · 2025-09-04T15:54:24Z

Yeah was waiting for you to come back.
This was the latest run, from the logs are you able to tell if the workaround implemented in this PR worked properly?

The deploy failed because of an unrelated reason on a latter task.

akrzos · 2025-09-04T17:30:30Z

Yeah was waiting for you to come back. This was the latest run, from the logs are you able to tell if the workaround implemented in this PR worked properly?

The deploy failed because of an unrelated reason on a latter task.

In the future, feel free to not wait for me to determine if the patch ran or not.

I looked at the output and searched for the newly added tasks and found this:

TASK [boot-iso : Dell - Eject any CD Virtual Media] ****************************
Thursday X1 August X0X5  1X:13:XX +0000 (0:00:15.X81)       0:05:35.878 ******* 
fatal: [fXX-h11-000-rX30.rduX.XXXXXXXX.redhat.com]: FAILED! => {"accept_ranges": "bytes", "cache_control": "no-cache", "changed": false, "connection": "close", "content": "{\"error\":{\"@Message.ExtendedInfo\":[{\"Message\":\"No Virtual Media devices are currently connected.\",\"MessageArgs\":[],\"[email protected]\":0,\"MessageId\":\"IDRAC.1.X.VRM0009\",\"RelatedProperties\":[],\"[email protected]\":0,\"Resolution\":\"No response action is required.\",\"Severity\":\"Critical\"},{\"Message\":\"The request failed due to an internal service error.  The service is still operational.\",\"MessageArgs\":[],\"[email protected]\":0,\"MessageId\":\"Base.1.X.InternalError\",\"RelatedProperties\":[],\"[email protected]\":0,\"Resolution\":\"Resubmit the request.  If the problem persists, consider resetting the service.\",\"Severity\":\"Critical\"}],\"code\":\"Base.1.X.GeneralError\",\"message\":\"A general error has occurred. See ExtendedInfo for more information\"}}\n", "content_length": "77X", "content_type": "application/json;odata.metadata=minimal;charset=utf-8", "date": "Thu, X1 Aug X0X5 19:13:50 GMT", "elapsed": 8, "json": {"error": {"@Message.ExtendedInfo": [{"Message": "No Virtual Media devices are currently connected.", "MessageArgs": [], "[email protected]": 0, "MessageId": "IDRAC.1.X.VRM0009", "RelatedProperties": [], "[email protected]": 0, "Resolution": "No response action is required.", "Severity": "Critical"}, {"Message": "The request failed due to an internal service error.  The service is still operational.", "MessageArgs": [], "[email protected]": 0, "MessageId": "Base.1.X.InternalError", "RelatedProperties": [], "[email protected]": 0, "Resolution": "Resubmit the request.  If the problem persists, consider resetting the service.", "Severity": "Critical"}], "code": "Base.1.X.GeneralError", "message": "A general error has occurred. See ExtendedInfo for more information"}}, "msg": "Status code was 500 and not [X0X]: HTTP Error 500: Internal Server Error", "odata_version": "X.0", "redirected": false, "server": "iDRAC/8", "status": 500, "strict_transport_security": "max-age=X307X000", "url": "https://mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com/redfish/v1/Managers/iDRAC.Embedded.1/VirtualMedia/CD/Actions/VirtualMedia.EjectMedia", "vary": "Accept-Encoding", "x_frame_options": "SAMEORIGIN"}

TASK [boot-iso : Force mount of a existing image] ******************************
Thursday X1 August X0X5  1X:13:50 +0000 (0:00:08.X78)       0:05:XX.557 ******* 
changed: [fXX-h11-000-rX30.rduX.XXXXXXXX.redhat.com -> mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com] => {"changed": true, "rc": 0, "stderr": "Warning: Permanently added 'mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com' (ECDSA) to the list of known hosts.\r\nShared connection to mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com closed.\r\n", "stderr_lines": ["Warning: Permanently added 'mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com' (ECDSA) to the list of known hosts.", "Shared connection to mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com closed."], "stdout": "Remote Image is now Configured\r\n", "stdout_lines": ["Remote Image is now Configured"]}

TASK [boot-iso : Force unmount of the existing image] **************************
Thursday X1 August X0X5  1X:13:5X +0000 (0:00:05.89X)       0:05:50.XX9 ******* 
changed: [fXX-h11-000-rX30.rduX.XXXXXXXX.redhat.com -> mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com] => {"changed": true, "rc": 0, "stderr": "Shared connection to mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com closed.\r\n", "stderr_lines": ["Shared connection to mgmt-fXX-h1X-000-rX30.rduX.XXXXXXXX.redhat.com closed."], "stdout": "Disable Remote File Started. Please check status using -s\r\noption to know Remote File Share is ENABLED or DISABLED.\r\n", "stdout_lines": ["Disable Remote File Started. Please check status using -s", "option to know Remote File Share is ENABLED or DISABLED."]}

I see the initial failure for task boot-iso : Dell - Eject any CD Virtual Media and then we see the two new rescue tasks were executed... boot-iso : Force mount of a existing image and boot-iso : Force unmount of the existing image . It seems to me they are working properly.

josecastillolema · 2025-09-04T21:45:07Z

Then this PR lgtm! I think we are good to merge

josecastillolema · 2025-09-04T21:45:39Z

ansible/roles/boot-iso/tasks/dell.yml

+      status_code: 204
+      return_content: yes
+
+  # # Eject just the found image


Should we delete this commented block?

Sure, only left it in there if there was a case we needed it but I guess years later we still didn't use it.

openshift-ci bot added the do-not-merge/work-in-progress label Aug 6, 2025

akrzos requested review from josecastillolema and mcornea August 6, 2025 12:51

akrzos marked this pull request as ready for review August 14, 2025 12:27

openshift-ci bot removed the do-not-merge/work-in-progress label Aug 14, 2025

openshift-ci bot requested a review from rsevilla87 August 14, 2025 12:27

akrzos force-pushed the fix_stuck_virtual_media branch from 3969d7d to 1a768ed Compare August 14, 2025 13:05

akrzos force-pushed the fix_stuck_virtual_media branch from 1a768ed to 838ac45 Compare August 15, 2025 12:41

akrzos force-pushed the fix_stuck_virtual_media branch from 838ac45 to 90c1c71 Compare August 18, 2025 13:25

akrzos force-pushed the fix_stuck_virtual_media branch 2 times, most recently from d7eb2f9 to f7eec43 Compare August 21, 2025 12:32

josecastillolema reviewed Sep 4, 2025

View reviewed changes

akrzos force-pushed the fix_stuck_virtual_media branch from f7eec43 to 23b93e0 Compare September 5, 2025 12:45

Address stuck virtual media

e3b8a5c

akrzos force-pushed the fix_stuck_virtual_media branch from 23b93e0 to e3b8a5c Compare September 5, 2025 12:48

Address stuck virtual media #679

Are you sure you want to change the base?

Address stuck virtual media #679

Conversation

akrzos commented Aug 6, 2025

Uh oh!

openshift-ci bot commented Aug 6, 2025

Uh oh!

openshift-ci bot commented Aug 6, 2025

Uh oh!

akrzos commented Aug 6, 2025

Uh oh!

openshift-ci bot commented Aug 6, 2025

Uh oh!

akrzos commented Aug 6, 2025

Uh oh!

akrzos commented Aug 8, 2025

Uh oh!

akrzos commented Aug 8, 2025

Uh oh!

akrzos commented Aug 11, 2025

Uh oh!

akrzos commented Aug 11, 2025

Uh oh!

akrzos commented Aug 11, 2025

Uh oh!

akrzos commented Aug 11, 2025

Uh oh!

josecastillolema commented Aug 12, 2025

Uh oh!

josecastillolema commented Aug 12, 2025

Uh oh!

josecastillolema commented Aug 12, 2025

Uh oh!

mcornea commented Aug 12, 2025

Uh oh!

mcornea commented Aug 13, 2025

Uh oh!

openshift-ci bot commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josecastillolema commented Aug 13, 2025

Uh oh!

mcornea commented Aug 13, 2025

Uh oh!

mcornea commented Aug 13, 2025

Uh oh!

mcornea commented Aug 13, 2025

Uh oh!

josecastillolema commented Aug 13, 2025

Uh oh!

josecastillolema commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akrzos commented Aug 18, 2025

Uh oh!

josecastillolema commented Aug 20, 2025

Uh oh!

akrzos commented Sep 4, 2025

Uh oh!

josecastillolema commented Sep 4, 2025

Uh oh!

akrzos commented Sep 4, 2025

Uh oh!

josecastillolema commented Sep 4, 2025

Uh oh!

josecastillolema Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

akrzos Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci bot commented Aug 13, 2025 •

edited

Loading

josecastillolema commented Aug 13, 2025 •

edited

Loading