- 
                Notifications
    You must be signed in to change notification settings 
- Fork 4.8k
NO-JIRA: Improve performance of run-resourcewatch #29955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /cc @deads2k | 
| @mdbooth: This pull request explicitly references no jira issue. In response to this: 
 Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. | 
| I can't reproduce those lint errors locally, and they're not obviously related to this PR. Related to the Go bump, perhaps? Does anybody know what's going on here? | 
| Job Failure Risk Analysis for sha: f9867d9 
 | 
| Bumping go to 1.24 separately: #29965 | 
| The Go 1.24 bump looks like a can of worms. I've updated the benchmark test to use only 1.23 features. | 
| Results of running the gitstore benchmark with  Before optimisation: After optimisation:  | 
25f0996    to
    d0124f1      
    Compare
  
    | Job Failure Risk Analysis for sha: d0124f1 
 | 
| /test verify-deps | 
| Looks like verify-deps failures  | 
        
          
                pkg/resourcewatch/git/git_store.go
              
                Outdated
          
        
      | // into a Git repository. Each change is stored as separate commit which means a full history of the | ||
| // resource lifecycle is preserved. | ||
| func NewGitStorage(path string) (*GitStorage, error) { | ||
| // If the repo does not exists, do git init | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't gonna comment on the nit but since there is a verify-deps issue exists -> exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this typo has been around since the code was added in 2020, but I'll fix it now we've seen it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just popped out due to getting moved around I guess.  No issue if you just want to hit the go mod tidy.
We replace the use of informers with an explicit list and watch. We also implement: * event de-duplication * missed delete detection
These allow: * writing to a json file instead of a git repository * reading from a json file instead of an api-server
Includes a refactor which splits up functionality to enable writing the benchmark.
There's no reason for any of these commands to fail locally, and if they did there's no reason they would succeed if retried.
By default, any git command can kick off a gc process which will continue to run after the original command completes. If this happens, all subsequent git commands will fail until the gc completes.
run-resourcewatch should no longer race with itself, but this adds robustness in case another process (like an impatient manual tester) is running git commands in the same repo.
| /lgtm | 
| [APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mdbooth, neisw The full list of commands accepted by this bot can be found here. The pull request process is described here 
Needs approval from an approver in each of these files:
 
 Approvers can indicate their approval by writing  | 
    
      
        1 similar comment
      
    
  
    | @mdbooth: The following tests failed, say  
 Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. | 
| Job Failure Risk Analysis for sha: 7526f99 
 | 
a048b28
      into
      
  
    openshift:main
  
    | [ART PR BUILD NOTIFIER] Distgit: openshift-enterprise-tests | 
This makes a number of substantial changes to run-resourcewatch, but does not change the output format, or the behaviour of any existing commands.
This re-implements how we observe resources and the write loop. We no longer use informers, but instead use explicit list and watch. We also de-duplicate resources and reconstruct missed deletes.
Each resource now has its own goroutine, but all resource observations are pushed to a single channel. This means that writes to the git repository are now serialised. This is a natural fit, as git does not support concurrent writers in any case.
We buffer between resource observations and writing to git by making the resource channel very large. This means that any write backlog will be buffered in memory. However, with the performance improvements to git writing in this PR it seems to be able to keep up. Hopefully this will not result in excessive memory usage in practise.
These are additional, mutually exclusive command line options.
--to-json causes run-resourcewatch to write resource observations as a stream of json objects to a single file
--from-json causes run-resourcewatch to read resource observations from a file instead of from a cluster.
These would allow the git repository to be created as a post-processing step if required. Hopefully this will not be required, but these options have also been invaluable in testing.
This adds a simple git writing benchmark test and some captured data from an AWS cluster installation.
Note that this adds a 14M gzip compressed json file to the repo as test data.
Together the above changes yielded more than a 10x performance improvement when writing to git as measured by the benchmark test executed for 30 seconds, at least on my workstation. The majority of the performance gains come from the first commit which completely serialises gitstorage. This yielded a 7x performance improvement on its own.
An AWS cluster installation takes approximately 30-40 minutes, and on my workstation at least run-resourcewatch can write the full dataset from json to git in 7 minutes. I believe it should be able to keep up without requiring a post-processing step.
This change uses the time at which the observation was made as the commit date rather than the time of the commit. This means that commit date are still potentially meaningful even if we had some buffering due to slow writes, or even if we wrote to json and post-processed the output to a git repo.
Together these resolve the issue of git commands failing due to 'index.lock'. It seems that this was caused when a git command triggered automatic garbage collection, which leaves a background process running holding a lock, causing subsequent commands to fail until the gc completes. We now disable auto gc during collection, and instead run a single gc before exiting.
I re-added a retry loop only because I thought I saw an error even after disabling auto gc. However I have not been able to reproduce this after many attempts, so this may have been a hallucination.
In general, we no longer expect local git commands to fail.