-
Notifications
You must be signed in to change notification settings - Fork 98
feat: Dependency Graph Analytics #1699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
closes X-lab2017#1682 feat: Implement SBOM Collection and Analysis **This PR introduces a new feature to OpenDigger for collecting and analyzing Software Bill of Materials (SBOM) data from top-level OpenRank projects.** This enhancement allows OpenDigger to provide valuable insights into the dependency landscape of the open source ecosystem. ### Key Changes: 1. **New Cron Task (`src/cron/tasks/sbomCollection.ts`):** * A new daily cron task has been created to fetch SBOM data using GitHub's SBOM API endpoint (`/repos/{owner}/{repo}/dependency-graph/sbom`). * Initially, the task focuses on the top 500 repositories by OpenRank to gather data from the most influential projects. * Collected data is stored in a new ClickHouse table. 2. **New ClickHouse Table (`gh_repo_sbom`):** * A dedicated table `gh_repo_sbom` has been designed to store the SBOM data, including the raw SBOM, repository information, update timestamps, dependency counts, package managers, and direct dependencies. 3. **New SBOM Metrics Module (`src/metrics/sbom.ts`):** * A new metrics module provides functionalities to analyze the collected SBOM data: * `repoDependencies`: Retrieves the list of dependencies for a specific repository. * `repoDependenciesByPackageManager`: Groups dependencies by their package manager for a repository. * `commonDependencies`: Identifies the most frequently used dependencies across the analyzed repositories. * `packageDependents`: Finds repositories that depend on a given package. * `dependencyGraph`: Provides data to help visualize the dependency relationships for a repository (currently direct dependencies). 4. **Metrics Index Update (`src/metrics/index.ts`):** * The main metrics index file has been updated to export the new SBOM metric functions, making them accessible through the OpenDigger API. 5. **Documentation Update (`README.md`):** * The project's `README.md` has been updated to include information about the new SBOM analysis capabilities, outlining its benefits and features. ### Benefits: * **Ecosystem Insights:** Provides a better understanding of how dependencies are interconnected within the open source world. * **Criticality Analysis:** Helps identify libraries and packages that are critical to a large number of projects. * **Trend Analysis:** Enables tracking of dependency usage patterns and trends over time. * **Potential Risk Assessment:** Lays the groundwork for future security vulnerability monitoring within dependencies. ### How to Test: 1. Ensure ClickHouse is running and accessible. 2. Configure GitHub API tokens in the local configuration. 3. The cron task `sbomCollection` can be enabled and run immediately by modifying the `task.enable` and `task.immediate` arrays in the local configuration and running `npm run cron`. 4. The new SBOM metrics can be tested using a script similar to other metric tests or via a Jupyter Notebook, calling functions like `openDigger.repoDependencies({repo: 'owner/repo'})`.
|
Thanks for the PR, and I know why you raise an issue about local test. Currently, use But for cron tasks like this is quite hard to test using public resource, since we cannot expose a writable ClickHouse instance publicly. Is there any best practice to support local test in this scenario? |
|
Maybe instead of using public resource and exposing a writable ClickHouse Instance we can create dedicated test databases, which can be built in local ClickHouse instance. This might help for cron tasks like this |
|
Actually we have ClickHouse Docker image and some sample dataset, but since ClickHouse official Docker image set the data directory as volume, so it is not feasible to use official image to build an image with dataset inside the image which makes the cost to setup a local instance quite high for developers. Do you any ideas to make this happen? Like a developer can just pull and start a docker container to get a local instance? |
|
Okk, so I was looking into it, I found a solution but I don't know how feasible it is. We can create a Docker image with pre-loaded datasets despite the volume directive by using ClickHouse's initialization scripts feature. We have to create a SQL script to load sample dataset and place them in The initialization scripts run at container startup before the volume mount is finalized. The data will be initialized on the first run and persisted in the volume thereafter. Devs can run docker pull with the samples to get a local instance. |
|
Sure, and I recall that actually we tried this approach in the first place few years ago when we start to provide sample data and ClickHouse image. The thing is that I can not control the import speed of ClickHouse instance so the initialization process consume lots of memory in a short time. For most sample dataset we provided, the data import process cost more than 8GB memory and can not be reduced by controlling the import speed. And since lots of students laptop could not provide 8GB memory for a container instance, so we give up and let students download and import data themselves instead. I can not remember the detail, I think you can try it locally to see that if the initialization process can be done with 4GB memory. |
|
Okk, Can you open an Issue for the same so that even if I don't get to the solution someone might in the future. |
closes #1682
feat: Implement SBOM Collection and Analysis
This PR introduces a new feature to OpenDigger for collecting and analyzing Software Bill of Materials (SBOM) data from top-level OpenRank projects.
This enhancement allows OpenDigger to provide valuable insights into the dependency landscape of the open source ecosystem.
Key Changes:
New Cron Task (
src/cron/tasks/sbomCollection.ts):/repos/{owner}/{repo}/dependency-graph/sbom).New ClickHouse Table (
gh_repo_sbom):gh_repo_sbomhas been designed to store the SBOM data, including the raw SBOM, repository information, update timestamps, dependency counts, package managers, and direct dependencies.New SBOM Metrics Module (
src/metrics/sbom.ts):repoDependencies: Retrieves the list of dependencies for a specific repository.repoDependenciesByPackageManager: Groups dependencies by their package manager for a repository.commonDependencies: Identifies the most frequently used dependencies across the analyzed repositories.packageDependents: Finds repositories that depend on a given package.dependencyGraph: Provides data to help visualize the dependency relationships for a repository (currently direct dependencies).Metrics Index Update (
src/metrics/index.ts):Documentation Update (
README.md):README.mdhas been updated to include information about the new SBOM analysisHow to Test:
sbomCollectioncan be enabled and run immediately by modifying thetask.enableandtask.immediatearrays in the local configuration and runningnpm run cron.openDigger.repoDependencies({repo: 'owner/repo'}).