fix: osv data source memory consumption #4956

fil1n · 2025-03-19T21:42:25Z

Currently, the get_ecosystem_incremental method uses gsutil to fetch info about new files in the ecosystem. However, this method is executed concurrently for each item in the ecosystem list (there are around 426 items), while other data sources are also updating, which led to the launch of the OOM Killer. This is why gsutil was replaced with the google-cloud-storage library, which allows iterating over the data without fetching it fully.

This reverts commit 28eb7ac.

cve_bin_tool/data_sources/new_osv_source.py

fil1n · 2025-03-23T20:20:30Z

@terriko I believe all major memory fixes have been implemented in this PR for the OSV data source. Now the cve-bin-tool consumes around 3 GB at peak during database updates with all data sources enabled. Additionally, I think we could improve memory performance in other data sources that read content from disk by using generators, rather than keeping the content of all files in RAM. There's no need to fully rewrite sources like OSV, so we could potentially reduce memory consumption even further with small changes.

terriko · 2025-03-24T18:53:02Z

This is very exciting, thank you for working on it. getting rid of gsutil entirely would be amazing; it has a really large effect on our dependencies and causes me kind of an ongoing hassle as a result. Of course, I have no idea if the new library is just as bad.

I've enabled tests to run, but quick heads up that because we're changing dependencies, I need to run licenses through legal before this can be merged, so it's likely to languish for a while until I get around to doing that paperwork.

ffontaine · 2025-07-17T19:36:51Z

Hi @fil1n, this is indeed really interesting, and I would like to test it.
Could you rebase this PR to fix the conflict in test_source_osv.py?
Moreover, why a new_osv_source.py is created instead of updating existing osv_source.py?

fil1n · 2025-07-20T18:08:11Z

Hi @ffontaine , I’m hoping to resolve the conflicts tomorrow. As for your second question – I was just looking to get some feedback on my code, so there might still be a few small changes coming, and I’ll definitely rename the file.

terriko · 2025-07-31T23:35:41Z

Sorry, was out of the country and had some computer issues. I've approved the tests to run now!

ffontaine · 2025-08-14T08:47:34Z

Tests are failing on:

FAILED test/test_package_list_parser.py::TestPackageListParser::test_parse_list_requirements - AssertionError: assert 1 == 2
 +  where 1 = len(defaultdict(<class 'dict'>, {ProductInfo(vendor='python*', product='requests', version='2.25.1', purl='/usr/local/lib/python/requests'): {'default': {'remarks': <Remarks.NewFound: 1>, 'comments': '', 'severity': ''}, 'paths': {''}}}))

I'm not able to reproduce it locally and this error seems unrelated to this PR, so I'll update the branch.

fil1n · 2025-08-14T09:10:20Z

@ffontaine seems like the issue is still around. I will check it out this weekend.

fil1n · 2025-08-29T14:03:57Z

@ffontaine I guess I have found the reason. gsutil depends on httplib2, however, new lib does not, hence httplib2 is not installed. I've updated the requirements.txt.

fil1n · 2025-08-29T17:02:18Z

@terriko Could you please approve the tests to run again?

cve_bin_tool/data_sources/osv_source.py

+            product = package.get("name")
+            vendor = "unknown"  # OSV Schema does not provide vendor names for packages
+
+            if product.startswith("github.com/"):


fil1n · 2025-08-29T18:33:34Z

@ffontaine https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-modified.meta is blocked by cloudflare, so I think we can consider testing as successful.

fil1n · 2025-09-06T12:46:53Z

@ffontaine Do you need any help with testing?

alex-ter

Here's an initial pass without checking the logic itself just yet. I plan to review the logic in detail and run some tests as the next step, though it may take a bit for me to find the time.

alex-ter · 2025-11-10T19:00:03Z

cve_bin_tool/data_sources/osv_source.py

-# Copyright (C) 2022 Intel Corporation
-# SPDX-License-Identifier: GPL-3.0-or-later


Don't remove these lines. It looks like there's a lot of new code and I guess you've simply replaced the old OSV file with the new one that you originally had in the PR, and the latter file didn't have this header, but there seems to be at least some common code (e.g., piece starting with # Ensure the CVSS vector is valid), so (a) the old copyright is still valid and (b) the SPDX identifier needs to be present in any case.

You can of course add your own copyright line if you wish, but the golden rule is to avoid removing the old ones.

alex-ter · 2025-11-10T19:10:32Z

cve_bin_tool/data_sources/osv_source.py

+        blobs = self._client.list_blobs(self.bucket_name)
+        for blob in blobs:
+            if blob.name.endswith("all.zip"):


This deviates from the previously used approach using the ecosystems.txt file and looks somewhat wasteful for it downloads all blob names and then filters out only some. Why?

The OSV official docs on ecosystem naming seem to suggest the ecosystems.txt is the way to go. It will automatically take care of the likes of Debian:10 being filtered out on lines 59+.

alex-ter · 2025-11-10T19:13:45Z

cve_bin_tool/data_sources/osv_source.py

+                LOGGER.warning(f"OSV: Error while extracting {file}.")
+            finally:
+                os.remove(file)
+                await asyncio.sleep(0.5)


What is this for? Worth adding a comment to make sure everyone knows and is able to reason about it in 6 months or a year from now.

alex-ter · 2025-11-10T19:16:01Z

cve_bin_tool/data_sources/osv_source.py

-                    "unknown"  # OSV Schema does not provide vendor names for packages
-                )
+        severity: dict | None  # type: ignore
+        if severity and "CVSS_V3" in [x["type"] for x in severity]:


Apparently sometimes there're no type keys in the data, check out #5240 and port the logic added there (will need to be merged sooner or later anyway).

alex-ter · 2025-11-10T19:19:50Z

cve_bin_tool/data_sources/osv_source.py

+            "ID": cve_id,
+            "severity": severity if vector is not None else "unknown",
+            "description": content.get("summary", "unknown"),
+            "score": score if vector is not None else "unknown",  # noqa


(nit) Why noqa? An explanation would be helpful to make sure this is not something the project coding guidelines actually require. Same applies to other similar ones.

alex-ter · 2025-11-10T19:26:41Z

cve_bin_tool/cli.py


    if "OSV" not in disabled_sources:
-        source_osv = osv_source.OSV_Source(incremental_update=incremental_db_update)
+        source_osv = osv_source.OSVDataSource()


(nit) The new name does not follow the format all the other source classes use. I guess this is an artifact of it initially existing alongside the original implementation, but now that there's only one, I see no reason for deviating from the convention.

alex-ter · 2025-11-10T19:43:04Z

requirements.csv

 python,urllib3
-google,gsutil
+google,google-cloud-storage
+jcgregorio_not_in_db,httplib2


This should be either

httplib2_project,httplib2

or

httplib2,httplib2

as there are CVEs against it already and corresponding vendor entries (see #1403 for details), however the fun part is that it's 50/50 - two places call it httplib2_project (cvedetails.com, NVD) and another two call it httplib2 (EUVD, cve.org), so I guess pick either one you prefer.

fil1n added 19 commits March 12, 2025 21:29

feat: added new OSV data source class initial implementation

1a2eff3

feat: added file fetching to OSV datasource

c146c45

feat: added zip extraction for osv

0b5a52f

feat: added data formatting method

f076096

chore: added google-cloud-storage to requirements

ec2cad4

feat: replaced old OSV source

803ce56

feat: minor logging and naming improvements

f9e7b95

feat: decompressed small files in memory

d3a1e92

feat: minor code improvements

a240bda

test: adapted tests for new data source

3b403b4

refactor: moved json parsing to separate function

7aeca1b

docs: added some comments

77dd8e5

refactor: useless code line removed

38fd375

docs: docstring changed

8fc8aa7

fix: ignored None values

3f29e1b

Merge branch 'main' into new_osv_datasource

2f9b2a0

chore: updated requirements

28eb7ac

Revert "chore: updated requirements"

79a3d81

This reverts commit 28eb7ac.

chore: updated

256009e

github-advanced-security bot found potential problems Mar 20, 2025

View reviewed changes

cve_bin_tool/data_sources/new_osv_source.py Fixed Show fixed Hide fixed

fil1n added 2 commits March 23, 2025 23:56

refactor: removed unused field

e43d745

Merge branch 'main' into new_osv_datasource

1c27571

terriko added the dependencies Pull requests that update a dependency file label Mar 24, 2025

ffontaine mentioned this pull request Jul 15, 2025

test: re-enable or replace test_update when possible #5218

Open

fil1n added 2 commits July 21, 2025 20:28

Merge branch 'main' into new_osv_datasource

eedfe4f

fix: DEFAULT_LOCATION import

afac134

terriko added the awaiting CI label Jul 31, 2025

Merge branch 'main' into new_osv_datasource

e958858

fil1n added 2 commits August 29, 2025 18:29

refactor: removed old osv data source

30028b3

fix: added httplib2 to requirements

2a69c73

github-advanced-security bot found potential problems Aug 29, 2025

View reviewed changes

alex-ter mentioned this pull request Oct 26, 2025

ImportError: sys.meta_path is None, Python is likely shutting down #5403

Open

alex-ter suggested changes Nov 10, 2025

View reviewed changes

		# Copyright (C) 2022 Intel Corporation
		# SPDX-License-Identifier: GPL-3.0-or-later

fix: osv data source memory consumption #4956

Are you sure you want to change the base?

fix: osv data source memory consumption #4956

Conversation

fil1n commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fil1n commented Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

terriko commented Mar 24, 2025

Uh oh!

ffontaine commented Jul 17, 2025

Uh oh!

fil1n commented Jul 20, 2025

Uh oh!

terriko commented Jul 31, 2025

Uh oh!

ffontaine commented Aug 14, 2025

Uh oh!

fil1n commented Aug 14, 2025

Uh oh!

fil1n commented Aug 29, 2025

Uh oh!

fil1n commented Aug 29, 2025

Uh oh!

Check failure

Uh oh!

fil1n commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fil1n commented Sep 6, 2025

Uh oh!

alex-ter left a comment

Choose a reason for hiding this comment

Uh oh!

alex-ter Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

alex-ter Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

alex-ter Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

alex-ter Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

alex-ter Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

alex-ter Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

alex-ter Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fil1n commented Mar 19, 2025 •

edited

Loading

fil1n commented Mar 23, 2025 •

edited

Loading

fil1n commented Aug 29, 2025 •

edited

Loading