Skip to content

Conversation

@Saswatsusmoy
Copy link

@Saswatsusmoy Saswatsusmoy commented May 11, 2025

closes #1682
feat: Implement SBOM Collection and Analysis

This PR introduces a new feature to OpenDigger for collecting and analyzing Software Bill of Materials (SBOM) data from top-level OpenRank projects.

This enhancement allows OpenDigger to provide valuable insights into the dependency landscape of the open source ecosystem.

Key Changes:

  1. New Cron Task (src/cron/tasks/sbomCollection.ts):

    • A new daily cron task has been created to fetch SBOM data using GitHub's SBOM API endpoint (/repos/{owner}/{repo}/dependency-graph/sbom).
    • Initially, the task focuses on the top 500 repositories by OpenRank to gather data from the most influential projects.
    • Collected data is stored in a new ClickHouse table.
  2. New ClickHouse Table (gh_repo_sbom):

    • A dedicated table gh_repo_sbom has been designed to store the SBOM data, including the raw SBOM, repository information, update timestamps, dependency counts, package managers, and direct dependencies.
  3. New SBOM Metrics Module (src/metrics/sbom.ts):

    • A new metrics module provides functionalities to analyze the collected SBOM data:
      • repoDependencies: Retrieves the list of dependencies for a specific repository.
      • repoDependenciesByPackageManager: Groups dependencies by their package manager for a repository.
      • commonDependencies: Identifies the most frequently used dependencies across the analyzed repositories.
      • packageDependents: Finds repositories that depend on a given package.
      • dependencyGraph: Provides data to help visualize the dependency relationships for a repository (currently direct dependencies).
  4. Metrics Index Update (src/metrics/index.ts):

    • The main metrics index file has been updated to export the new SBOM metric functions, making them accessible through the OpenDigger API.
  5. Documentation Update (README.md):

    • The project's README.md has been updated to include information about the new SBOM analysis

How to Test:

  1. Ensure ClickHouse is running and accessible.
  2. Configure GitHub API tokens in the local configuration.
  3. The cron task sbomCollection can be enabled and run immediately by modifying the task.enable and task.immediate arrays in the local configuration and running npm run cron.
  4. The new SBOM metrics can be tested using a script similar to other metric tests or via a Jupyter Notebook, calling functions like openDigger.repoDependencies({repo: 'owner/repo'}).

Saswatsusmoy added 2 commits May 12, 2025 02:32
closes X-lab2017#1682
feat: Implement SBOM Collection and Analysis

**This PR introduces a new feature to OpenDigger for collecting and analyzing Software Bill of Materials (SBOM) data from top-level OpenRank projects.**

This enhancement allows OpenDigger to provide valuable insights into the dependency landscape of the open source ecosystem.

### Key Changes:

1.  **New Cron Task (`src/cron/tasks/sbomCollection.ts`):**
    *   A new daily cron task has been created to fetch SBOM data using GitHub's SBOM API endpoint (`/repos/{owner}/{repo}/dependency-graph/sbom`).
    *   Initially, the task focuses on the top 500 repositories by OpenRank to gather data from the most influential projects.
    *   Collected data is stored in a new ClickHouse table.

2.  **New ClickHouse Table (`gh_repo_sbom`):**
    *   A dedicated table `gh_repo_sbom` has been designed to store the SBOM data, including the raw SBOM, repository information, update timestamps, dependency counts, package managers, and direct dependencies.

3.  **New SBOM Metrics Module (`src/metrics/sbom.ts`):**
    *   A new metrics module provides functionalities to analyze the collected SBOM data:
        *   `repoDependencies`: Retrieves the list of dependencies for a specific repository.
        *   `repoDependenciesByPackageManager`: Groups dependencies by their package manager for a repository.
        *   `commonDependencies`: Identifies the most frequently used dependencies across the analyzed repositories.
        *   `packageDependents`: Finds repositories that depend on a given package.
        *   `dependencyGraph`: Provides data to help visualize the dependency relationships for a repository (currently direct dependencies).

4.  **Metrics Index Update (`src/metrics/index.ts`):**
    *   The main metrics index file has been updated to export the new SBOM metric functions, making them accessible through the OpenDigger API.

5.  **Documentation Update (`README.md`):**
    *   The project's `README.md` has been updated to include information about the new SBOM analysis capabilities, outlining its benefits and features.

### Benefits:

*   **Ecosystem Insights:** Provides a better understanding of how dependencies are interconnected within the open source world.
*   **Criticality Analysis:** Helps identify libraries and packages that are critical to a large number of projects.
*   **Trend Analysis:** Enables tracking of dependency usage patterns and trends over time.
*   **Potential Risk Assessment:** Lays the groundwork for future security vulnerability monitoring within dependencies.

### How to Test:

1.  Ensure ClickHouse is running and accessible.
2.  Configure GitHub API tokens in the local configuration.
3.  The cron task `sbomCollection` can be enabled and run immediately by modifying the `task.enable` and `task.immediate` arrays in the local configuration and running `npm run cron`.
4.  The new SBOM metrics can be tested using a script similar to other metric tests or via a Jupyter Notebook, calling functions like `openDigger.repoDependencies({repo: 'owner/repo'})`.
@frank-zsy
Copy link
Contributor

Thanks for the PR, and I know why you raise an issue about local test.

Currently, use npm run test will run test scripts for statistical metrics and label data, this relies on a public readonly ClickHouse instance with few sample data in the events table.

But for cron tasks like this is quite hard to test using public resource, since we cannot expose a writable ClickHouse instance publicly. Is there any best practice to support local test in this scenario?

@Saswatsusmoy
Copy link
Author

Maybe instead of using public resource and exposing a writable ClickHouse Instance we can create dedicated test databases, which can be built in local ClickHouse instance. This might help for cron tasks like this

@frank-zsy
Copy link
Contributor

Actually we have ClickHouse Docker image and some sample dataset, but since ClickHouse official Docker image set the data directory as volume, so it is not feasible to use official image to build an image with dataset inside the image which makes the cost to setup a local instance quite high for developers. Do you any ideas to make this happen? Like a developer can just pull and start a docker container to get a local instance?

@Saswatsusmoy
Copy link
Author

Okk, so I was looking into it, I found a solution but I don't know how feasible it is.

We can create a Docker image with pre-loaded datasets despite the volume directive by using ClickHouse's initialization scripts feature. We have to create a SQL script to load sample dataset and place them in docker-entrypoint-initdb.d.

The initialization scripts run at container startup before the volume mount is finalized. The data will be initialized on the first run and persisted in the volume thereafter. Devs can run docker pull with the samples to get a local instance.

@frank-zsy
Copy link
Contributor

Sure, and I recall that actually we tried this approach in the first place few years ago when we start to provide sample data and ClickHouse image. The thing is that I can not control the import speed of ClickHouse instance so the initialization process consume lots of memory in a short time. For most sample dataset we provided, the data import process cost more than 8GB memory and can not be reduced by controlling the import speed. And since lots of students laptop could not provide 8GB memory for a container instance, so we give up and let students download and import data themselves instead.

I can not remember the detail, I think you can try it locally to see that if the initialization process can be done with 4GB memory.

@Saswatsusmoy
Copy link
Author

Okk, Can you open an Issue for the same so that even if I don't get to the solution someone might in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Collecting dependencies data for top GitHub projects based on GitHub API

2 participants