Skip to content

Conversation

@mdbooth
Copy link
Contributor

@mdbooth mdbooth commented Jul 4, 2025

This makes a number of substantial changes to run-resourcewatch, but does not change the output format, or the behaviour of any existing commands.

  • run-resourcewatch: re-implement watch loop

This re-implements how we observe resources and the write loop. We no longer use informers, but instead use explicit list and watch. We also de-duplicate resources and reconstruct missed deletes.

Each resource now has its own goroutine, but all resource observations are pushed to a single channel. This means that writes to the git repository are now serialised. This is a natural fit, as git does not support concurrent writers in any case.

We buffer between resource observations and writing to git by making the resource channel very large. This means that any write backlog will be buffered in memory. However, with the performance improvements to git writing in this PR it seems to be able to keep up. Hopefully this will not result in excessive memory usage in practise.

  • run-resourcewatch: Add --to-json and --from-json

These are additional, mutually exclusive command line options.

--to-json causes run-resourcewatch to write resource observations as a stream of json objects to a single file

--from-json causes run-resourcewatch to read resource observations from a file instead of from a cluster.

These would allow the git repository to be created as a post-processing step if required. Hopefully this will not be required, but these options have also been invaluable in testing.

  • run-resourcewatch: Add a gitstorage benchmark

This adds a simple git writing benchmark test and some captured data from an AWS cluster installation.

Note that this adds a 14M gzip compressed json file to the repo as test data.

  • run-resourcewatch: Remove goroutines from gitstorage
  • run-resourcewatch: Fix delete of non-observed resource
  • run-resourcewatch: Don't run git commands with bash
  • run-resourcewatch: Remove command polling

Together the above changes yielded more than a 10x performance improvement when writing to git as measured by the benchmark test executed for 30 seconds, at least on my workstation. The majority of the performance gains come from the first commit which completely serialises gitstorage. This yielded a 7x performance improvement on its own.

An AWS cluster installation takes approximately 30-40 minutes, and on my workstation at least run-resourcewatch can write the full dataset from json to git in 7 minutes. I believe it should be able to keep up without requiring a post-processing step.

  • run-resourcewatch: Use observation timestamp as commit date

This change uses the time at which the observation was made as the commit date rather than the time of the commit. This means that commit date are still potentially meaningful even if we had some buffering due to slow writes, or even if we wrote to json and post-processed the output to a git repo.

  • run-resourcewatch: Disable git automatic gc
  • run-resourcewatch: Re-add a retry loop for git commands

Together these resolve the issue of git commands failing due to 'index.lock'. It seems that this was caused when a git command triggered automatic garbage collection, which leaves a background process running holding a lock, causing subsequent commands to fail until the gc completes. We now disable auto gc during collection, and instead run a single gc before exiting.

I re-added a retry loop only because I thought I saw an error even after disabling auto gc. However I have not been able to reproduce this after many attempts, so this may have been a hallucination.

In general, we no longer expect local git commands to fail.

@openshift-ci openshift-ci bot requested review from p0lyn0mial and sjenning July 4, 2025 22:00
@mdbooth
Copy link
Contributor Author

mdbooth commented Jul 4, 2025

/cc @deads2k

@openshift-ci openshift-ci bot requested a review from deads2k July 4, 2025 22:01
@mdbooth mdbooth changed the title Improve performance of run-resourcewatch NO-JIRA: Improve performance of run-resourcewatch Jul 7, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 7, 2025
@openshift-ci-robot
Copy link

@mdbooth: This pull request explicitly references no jira issue.

In response to this:

This makes a number of substantial changes to run-resourcewatch, but does not change the output format, or the behaviour of any existing commands.

  • run-resourcewatch: re-implement watch loop

This re-implements how we observe resources and the write loop. We no longer use informers, but instead use explicit list and watch. We also de-duplicate resources and reconstruct missed deletes.

Each resource now has its own goroutine, but all resource observations are pushed to a single channel. This means that writes to the git repository are now serialised. This is a natural fit, as git does not support concurrent writers in any case.

We buffer between resource observations and writing to git by making the resource channel very large. This means that any write backlog will be buffered in memory. However, with the performance improvements to git writing in this PR it seems to be able to keep up. Hopefully this will not result in excessive memory usage in practise.

  • run-resourcewatch: Add --to-json and --from-json

These are additional, mutually exclusive command line options.

--to-json causes run-resourcewatch to write resource observations as a stream of json objects to a single file

--from-json causes run-resourcewatch to read resource observations from a file instead of from a cluster.

These would allow the git repository to be created as a post-processing step if required. Hopefully this will not be required, but these options have also been invaluable in testing.

  • run-resourcewatch: Add a gitstorage benchmark

This adds a simple git writing benchmark test and some captured data from an AWS cluster installation.

Note that this adds a 14M gzip compressed json file to the repo as test data.

  • run-resourcewatch: Remove goroutines from gitstorage
  • run-resourcewatch: Fix delete of non-observed resource
  • run-resourcewatch: Don't run git commands with bash
  • run-resourcewatch: Remove command polling

Together the above changes yielded more than a 10x performance improvement when writing to git as measured by the benchmark test executed for 30 seconds, at least on my workstation. The majority of the performance gains come from the first commit which completely serialises gitstorage. This yielded a 7x performance improvement on its own.

An AWS cluster installation takes approximately 30-40 minutes, and on my workstation at least run-resourcewatch can write the full dataset from json to git in 7 minutes. I believe it should be able to keep up without requiring a post-processing step.

  • run-resourcewatch: Use observation timestamp as commit date

This change uses the time at which the observation was made as the commit date rather than the time of the commit. This means that commit date are still potentially meaningful even if we had some buffering due to slow writes, or even if we wrote to json and post-processed the output to a git repo.

  • run-resourcewatch: Disable git automatic gc
  • run-resourcewatch: Re-add a retry loop for git commands

Together these resolve the issue of git commands failing due to 'index.lock'. It seems that this was caused when a git command triggered automatic garbage collection, which leaves a background process running holding a lock, causing subsequent commands to fail until the gc completes. We now disable auto gc during collection, and instead run a single gc before exiting.

I re-added a retry loop only because I thought I saw an error even after disabling auto gc. However I have not been able to reproduce this after many attempts, so this may have been a hallucination.

In general, we no longer expect local git commands to fail.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mdbooth
Copy link
Contributor Author

mdbooth commented Jul 7, 2025

I can't reproduce those lint errors locally, and they're not obviously related to this PR. Related to the Go bump, perhaps? Does anybody know what's going on here?

@openshift-trt
Copy link

openshift-trt bot commented Jul 7, 2025

Job Failure Risk Analysis for sha: f9867d9

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-azure-ovn-upgrade IncompleteTests
Tests for this run (0) are below the historical average (2506): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-local-gateway IncompleteTests
Tests for this run (20) are below the historical average (1671): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-openstack-serial IncompleteTests
Tests for this run (101) are below the historical average (1096): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@mdbooth
Copy link
Contributor Author

mdbooth commented Jul 7, 2025

Bumping go to 1.24 separately: #29965

@mdbooth
Copy link
Contributor Author

mdbooth commented Jul 8, 2025

The Go 1.24 bump looks like a can of worms. I've updated the benchmark test to use only 1.23 features.

@mdbooth
Copy link
Contributor Author

mdbooth commented Jul 8, 2025

Results of running the gitstore benchmark with -benchtime=30s on my workstation:

Before optimisation:

     358         223620723 ns/op

After optimisation:

    3026          15785690 ns/op

@mdbooth mdbooth force-pushed the resourcewatch branch 2 times, most recently from 25f0996 to d0124f1 Compare July 8, 2025 14:37
@openshift-trt
Copy link

openshift-trt bot commented Jul 8, 2025

Job Failure Risk Analysis for sha: d0124f1

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-gcp-disruptive Medium
[bz-Etcd] clusteroperator/etcd should not change condition/Available
Potential external regression detected for High Risk Test analysis
---
[sig-node] static pods should start after being created
Potential external regression detected for High Risk Test analysis
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn IncompleteTests
Tests for this run (22) are below the historical average (1641): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-local-gateway IncompleteTests
Tests for this run (22) are below the historical average (1633): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-kube-apiserver-rollout IncompleteTests
Tests for this run (12) are below the historical average (932): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-1of2 IncompleteTests
Tests for this run (22) are below the historical average (962): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-ovn-ipv6-2of2 IncompleteTests
Tests for this run (21) are below the historical average (958): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-openstack-serial IncompleteTests
Tests for this run (14) are below the historical average (1164): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@neisw
Copy link
Contributor

neisw commented Jul 9, 2025

/test verify-deps

@neisw
Copy link
Contributor

neisw commented Jul 9, 2025

Looks like verify-deps failures go.mod content is incorrect - did you run go mod tidy?`

// into a Git repository. Each change is stored as separate commit which means a full history of the
// resource lifecycle is preserved.
func NewGitStorage(path string) (*GitStorage, error) {
// If the repo does not exists, do git init
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't gonna comment on the nit but since there is a verify-deps issue exists -> exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this typo has been around since the code was added in 2020, but I'll fix it now we've seen it 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just popped out due to getting moved around I guess. No issue if you just want to hit the go mod tidy.

mdbooth added 10 commits July 14, 2025 12:34
We replace the use of informers with an explicit list and watch. We also
implement:
* event de-duplication
* missed delete detection
These allow:
* writing to a json file instead of a git repository
* reading from a json file instead of an api-server
Includes a refactor which splits up functionality to enable writing the
benchmark.
There's no reason for any of these commands to fail locally, and if they
did there's no reason they would succeed if retried.
By default, any git command can kick off a gc process which will
continue to run after the original command completes. If this happens,
all subsequent git commands will fail until the gc completes.
run-resourcewatch should no longer race with itself, but this adds
robustness in case another process (like an impatient manual tester) is
running git commands in the same repo.
@neisw
Copy link
Contributor

neisw commented Jul 14, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 14, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 14, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdbooth, neisw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 14, 2025
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 6a5d795 and 2 for PR HEAD 7526f99 in total

1 similar comment
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 6a5d795 and 2 for PR HEAD 7526f99 in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 15, 2025

@mdbooth: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-upgrade 7526f99 link false /test e2e-aws-ovn-upgrade
ci/prow/e2e-aws-ovn-serial-publicnet-1of2 7526f99 link false /test e2e-aws-ovn-serial-publicnet-1of2
ci/prow/e2e-hypershift-conformance 7526f99 link false /test e2e-hypershift-conformance
ci/prow/e2e-aws-ovn-cgroupsv2 7526f99 link false /test e2e-aws-ovn-cgroupsv2
ci/prow/e2e-aws-ovn-kube-apiserver-rollout 7526f99 link false /test e2e-aws-ovn-kube-apiserver-rollout
ci/prow/e2e-aws-ovn-etcd-scaling 7526f99 link false /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-gcp-disruptive 7526f99 link false /test e2e-gcp-disruptive
ci/prow/e2e-aws-ovn 7526f99 link false /test e2e-aws-ovn
ci/prow/e2e-aws-csi 7526f99 link false /test e2e-aws-csi
ci/prow/e2e-agnostic-ovn-cmd 7526f99 link false /test e2e-agnostic-ovn-cmd
ci/prow/e2e-gcp-ovn-etcd-scaling 7526f99 link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-vsphere-ovn-etcd-scaling 7526f99 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-azure-ovn-etcd-scaling 7526f99 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/okd-e2e-gcp 7526f99 link false /test okd-e2e-gcp
ci/prow/e2e-aws-ovn-single-node-upgrade 7526f99 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-aws-ovn-single-node-serial 7526f99 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-aws-ovn-single-node 7526f99 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-gcp-fips-serial-1of2 7526f99 link false /test e2e-gcp-fips-serial-1of2
ci/prow/e2e-gcp-fips-serial-2of2 7526f99 link false /test e2e-gcp-fips-serial-2of2
ci/prow/e2e-aws-proxy 7526f99 link false /test e2e-aws-proxy
ci/prow/e2e-azure-ovn-upgrade 7526f99 link false /test e2e-azure-ovn-upgrade
ci/prow/e2e-vsphere-ovn-dualstack-primaryv6 7526f99 link false /test e2e-vsphere-ovn-dualstack-primaryv6
ci/prow/e2e-aws-ovn-serial-publicnet-2of2 7526f99 link false /test e2e-aws-ovn-serial-publicnet-2of2
ci/prow/e2e-aws 7526f99 link false /test e2e-aws
ci/prow/okd-scos-e2e-aws-ovn 7526f99 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-disruptive 7526f99 link false /test e2e-aws-disruptive

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-trt
Copy link

openshift-trt bot commented Jul 15, 2025

Job Failure Risk Analysis for sha: 7526f99

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws IncompleteTests
Tests for this run (103) are below the historical average (2464): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-aws-disruptive IncompleteTests
Tests for this run (103) are below the historical average (1163): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-aws-ovn IncompleteTests
Tests for this run (103) are below the historical average (2347): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-aws-ovn-edge-zones IncompleteTests
Tests for this run (23) are below the historical average (2570): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-aws-ovn-etcd-scaling IncompleteTests
Tests for this run (102) are below the historical average (1454): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-serial IncompleteTests
Tests for this run (103) are below the historical average (1407): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-aws-proxy IncompleteTests
Tests for this run (104) are below the historical average (2546): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-azure-ovn-etcd-scaling Low
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded
This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:azure SecurityMode:default Topology:ha Upgrade:none] in the last week.
---
[bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available
This test has passed 70.87% of 3556 runs on release 4.20 [Overall] in the last week.

Open Bugs
[CI] e2e-openstack-ovn-etcd-scaling job permanent fails at many openshift-test tests
pull-ci-openshift-origin-main-e2e-vsphere-ovn-etcd-scaling Low
[sig-api-machinery] disruption/cache-openshift-api apiserver/openshift-apiserver connection/new should be available throughout the test
This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:vsphere SecurityMode:default Topology:ha Upgrade:none] in the last week.
---
[sig-api-machinery] disruption/openshift-api apiserver/openshift-apiserver connection/new should be available throughout the test
This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:vsphere SecurityMode:default Topology:ha Upgrade:none] in the last week.
---
[sig-api-machinery] disruption/cache-oauth-api apiserver/oauth-apiserver connection/new should be available throughout the test
This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:vsphere SecurityMode:default Topology:ha Upgrade:none] in the last week.
---
[sig-api-machinery] disruption/kube-api apiserver/kube-apiserver connection/new should be available throughout the test
This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:vsphere SecurityMode:default Topology:ha Upgrade:none] in the last week.
---
Showing 4 of 8 test results

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 6a5d795 and 2 for PR HEAD 7526f99 in total

@openshift-merge-bot openshift-merge-bot bot merged commit a048b28 into openshift:main Jul 16, 2025
32 of 58 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-tests
This PR has been included in build openshift-enterprise-tests-container-v4.20.0-202507161216.p0.ga048b28.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants