-
Notifications
You must be signed in to change notification settings - Fork 577
fix: osv data source memory consumption #4956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This reverts commit 28eb7ac.
|
@terriko I believe all major memory fixes have been implemented in this PR for the OSV data source. Now the cve-bin-tool consumes around 3 GB at peak during database updates with all data sources enabled. Additionally, I think we could improve memory performance in other data sources that read content from disk by using generators, rather than keeping the content of all files in RAM. There's no need to fully rewrite sources like OSV, so we could potentially reduce memory consumption even further with small changes. |
|
This is very exciting, thank you for working on it. getting rid of gsutil entirely would be amazing; it has a really large effect on our dependencies and causes me kind of an ongoing hassle as a result. Of course, I have no idea if the new library is just as bad. I've enabled tests to run, but quick heads up that because we're changing dependencies, I need to run licenses through legal before this can be merged, so it's likely to languish for a while until I get around to doing that paperwork. |
|
Hi @fil1n, this is indeed really interesting, and I would like to test it. |
|
Hi @ffontaine , I’m hoping to resolve the conflicts tomorrow. As for your second question – I was just looking to get some feedback on my code, so there might still be a few small changes coming, and I’ll definitely rename the file. |
|
Sorry, was out of the country and had some computer issues. I've approved the tests to run now! |
|
Tests are failing on: I'm not able to reproduce it locally and this error seems unrelated to this PR, so I'll update the branch. |
|
@ffontaine seems like the issue is still around. I will check it out this weekend. |
|
@ffontaine I guess I have found the reason. gsutil depends on httplib2, however, new lib does not, hence httplib2 is not installed. I've updated the requirements.txt. |
|
@terriko Could you please approve the tests to run again? |
| product = package.get("name") | ||
| vendor = "unknown" # OSV Schema does not provide vendor names for packages | ||
|
|
||
| if product.startswith("github.com/"): |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High
github.com/
|
@ffontaine https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-modified.meta is blocked by cloudflare, so I think we can consider testing as successful. |
|
@ffontaine Do you need any help with testing? |
alex-ter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an initial pass without checking the logic itself just yet. I plan to review the logic in detail and run some tests as the next step, though it may take a bit for me to find the time.
| # Copyright (C) 2022 Intel Corporation | ||
| # SPDX-License-Identifier: GPL-3.0-or-later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't remove these lines. It looks like there's a lot of new code and I guess you've simply replaced the old OSV file with the new one that you originally had in the PR, and the latter file didn't have this header, but there seems to be at least some common code (e.g., piece starting with # Ensure the CVSS vector is valid), so (a) the old copyright is still valid and (b) the SPDX identifier needs to be present in any case.
You can of course add your own copyright line if you wish, but the golden rule is to avoid removing the old ones.
| blobs = self._client.list_blobs(self.bucket_name) | ||
| for blob in blobs: | ||
| if blob.name.endswith("all.zip"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deviates from the previously used approach using the ecosystems.txt file and looks somewhat wasteful for it downloads all blob names and then filters out only some. Why?
The OSV official docs on ecosystem naming seem to suggest the ecosystems.txt is the way to go. It will automatically take care of the likes of Debian:10 being filtered out on lines 59+.
| LOGGER.warning(f"OSV: Error while extracting {file}.") | ||
| finally: | ||
| os.remove(file) | ||
| await asyncio.sleep(0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for? Worth adding a comment to make sure everyone knows and is able to reason about it in 6 months or a year from now.
| "unknown" # OSV Schema does not provide vendor names for packages | ||
| ) | ||
| severity: dict | None # type: ignore | ||
| if severity and "CVSS_V3" in [x["type"] for x in severity]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently sometimes there're no type keys in the data, check out #5240 and port the logic added there (will need to be merged sooner or later anyway).
| "ID": cve_id, | ||
| "severity": severity if vector is not None else "unknown", | ||
| "description": content.get("summary", "unknown"), | ||
| "score": score if vector is not None else "unknown", # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Why noqa? An explanation would be helpful to make sure this is not something the project coding guidelines actually require. Same applies to other similar ones.
|
|
||
| if "OSV" not in disabled_sources: | ||
| source_osv = osv_source.OSV_Source(incremental_update=incremental_db_update) | ||
| source_osv = osv_source.OSVDataSource() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) The new name does not follow the format all the other source classes use. I guess this is an artifact of it initially existing alongside the original implementation, but now that there's only one, I see no reason for deviating from the convention.
| python,urllib3 | ||
| google,gsutil | ||
| google,google-cloud-storage | ||
| jcgregorio_not_in_db,httplib2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be either
httplib2_project,httplib2
or
httplib2,httplib2
as there are CVEs against it already and corresponding vendor entries (see #1403 for details), however the fun part is that it's 50/50 - two places call it httplib2_project (cvedetails.com, NVD) and another two call it httplib2 (EUVD, cve.org), so I guess pick either one you prefer.
Fixes #4710,
Currently, the get_ecosystem_incremental method uses gsutil to fetch info about new files in the ecosystem. However, this method is executed concurrently for each item in the ecosystem list (there are around 426 items), while other data sources are also updating, which led to the launch of the OOM Killer. This is why gsutil was replaced with the google-cloud-storage library, which allows iterating over the data without fetching it fully.