Skip to content

Conversation

@fil1n
Copy link
Contributor

@fil1n fil1n commented Mar 19, 2025

Fixes #4710,

Currently, the get_ecosystem_incremental method uses gsutil to fetch info about new files in the ecosystem. However, this method is executed concurrently for each item in the ecosystem list (there are around 426 items), while other data sources are also updating, which led to the launch of the OOM Killer. This is why gsutil was replaced with the google-cloud-storage library, which allows iterating over the data without fetching it fully.

@fil1n
Copy link
Contributor Author

fil1n commented Mar 23, 2025

@terriko I believe all major memory fixes have been implemented in this PR for the OSV data source. Now the cve-bin-tool consumes around 3 GB at peak during database updates with all data sources enabled. Additionally, I think we could improve memory performance in other data sources that read content from disk by using generators, rather than keeping the content of all files in RAM. There's no need to fully rewrite sources like OSV, so we could potentially reduce memory consumption even further with small changes.

@terriko
Copy link
Contributor

terriko commented Mar 24, 2025

This is very exciting, thank you for working on it. getting rid of gsutil entirely would be amazing; it has a really large effect on our dependencies and causes me kind of an ongoing hassle as a result. Of course, I have no idea if the new library is just as bad.

I've enabled tests to run, but quick heads up that because we're changing dependencies, I need to run licenses through legal before this can be merged, so it's likely to languish for a while until I get around to doing that paperwork.

@terriko terriko added the dependencies Pull requests that update a dependency file label Mar 24, 2025
@ffontaine
Copy link
Contributor

Hi @fil1n, this is indeed really interesting, and I would like to test it.
Could you rebase this PR to fix the conflict in test_source_osv.py?
Moreover, why a new_osv_source.py is created instead of updating existing osv_source.py?

@fil1n
Copy link
Contributor Author

fil1n commented Jul 20, 2025

Hi @ffontaine , I’m hoping to resolve the conflicts tomorrow. As for your second question – I was just looking to get some feedback on my code, so there might still be a few small changes coming, and I’ll definitely rename the file.

@terriko
Copy link
Contributor

terriko commented Jul 31, 2025

Sorry, was out of the country and had some computer issues. I've approved the tests to run now!

@ffontaine
Copy link
Contributor

Tests are failing on:

FAILED test/test_package_list_parser.py::TestPackageListParser::test_parse_list_requirements - AssertionError: assert 1 == 2
 +  where 1 = len(defaultdict(<class 'dict'>, {ProductInfo(vendor='python*', product='requests', version='2.25.1', purl='/usr/local/lib/python/requests'): {'default': {'remarks': <Remarks.NewFound: 1>, 'comments': '', 'severity': ''}, 'paths': {''}}}))

I'm not able to reproduce it locally and this error seems unrelated to this PR, so I'll update the branch.

@fil1n
Copy link
Contributor Author

fil1n commented Aug 14, 2025

@ffontaine seems like the issue is still around. I will check it out this weekend.

@fil1n
Copy link
Contributor Author

fil1n commented Aug 29, 2025

@ffontaine I guess I have found the reason. gsutil depends on httplib2, however, new lib does not, hence httplib2 is not installed. I've updated the requirements.txt.

@fil1n
Copy link
Contributor Author

fil1n commented Aug 29, 2025

@terriko Could you please approve the tests to run again?

product = package.get("name")
vendor = "unknown" # OSV Schema does not provide vendor names for packages

if product.startswith("github.com/"):

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
github.com/
may be at an arbitrary position in the sanitized URL.
@fil1n
Copy link
Contributor Author

fil1n commented Aug 29, 2025

@ffontaine https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-modified.meta is blocked by cloudflare, so I think we can consider testing as successful.

@fil1n
Copy link
Contributor Author

fil1n commented Sep 6, 2025

@ffontaine Do you need any help with testing?

Copy link
Contributor

@alex-ter alex-ter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an initial pass without checking the logic itself just yet. I plan to review the logic in detail and run some tests as the next step, though it may take a bit for me to find the time.

Comment on lines -1 to -2
# Copyright (C) 2022 Intel Corporation
# SPDX-License-Identifier: GPL-3.0-or-later
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't remove these lines. It looks like there's a lot of new code and I guess you've simply replaced the old OSV file with the new one that you originally had in the PR, and the latter file didn't have this header, but there seems to be at least some common code (e.g., piece starting with # Ensure the CVSS vector is valid), so (a) the old copyright is still valid and (b) the SPDX identifier needs to be present in any case.

You can of course add your own copyright line if you wish, but the golden rule is to avoid removing the old ones.

Comment on lines +51 to +53
blobs = self._client.list_blobs(self.bucket_name)
for blob in blobs:
if blob.name.endswith("all.zip"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deviates from the previously used approach using the ecosystems.txt file and looks somewhat wasteful for it downloads all blob names and then filters out only some. Why?

The OSV official docs on ecosystem naming seem to suggest the ecosystems.txt is the way to go. It will automatically take care of the likes of Debian:10 being filtered out on lines 59+.

LOGGER.warning(f"OSV: Error while extracting {file}.")
finally:
os.remove(file)
await asyncio.sleep(0.5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? Worth adding a comment to make sure everyone knows and is able to reason about it in 6 months or a year from now.

"unknown" # OSV Schema does not provide vendor names for packages
)
severity: dict | None # type: ignore
if severity and "CVSS_V3" in [x["type"] for x in severity]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently sometimes there're no type keys in the data, check out #5240 and port the logic added there (will need to be merged sooner or later anyway).

"ID": cve_id,
"severity": severity if vector is not None else "unknown",
"description": content.get("summary", "unknown"),
"score": score if vector is not None else "unknown", # noqa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Why noqa? An explanation would be helpful to make sure this is not something the project coding guidelines actually require. Same applies to other similar ones.


if "OSV" not in disabled_sources:
source_osv = osv_source.OSV_Source(incremental_update=incremental_db_update)
source_osv = osv_source.OSVDataSource()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) The new name does not follow the format all the other source classes use. I guess this is an artifact of it initially existing alongside the original implementation, but now that there's only one, I see no reason for deviating from the convention.

python,urllib3
google,gsutil
google,google-cloud-storage
jcgregorio_not_in_db,httplib2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be either

httplib2_project,httplib2

or

httplib2,httplib2

as there are CVEs against it already and corresponding vendor entries (see #1403 for details), however the fun part is that it's 50/50 - two places call it httplib2_project (cvedetails.com, NVD) and another two call it httplib2 (EUVD, cve.org), so I guess pick either one you prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting CI dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: out of memory during OSV database load

4 participants