-
Notifications
You must be signed in to change notification settings - Fork 2
CDS Extractor Rewrite Phase 2 : Improve Performance and Precision #195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
CDS Extractor Rewrite Phase 2 : Improve Performance and Precision #195
Conversation
This commit: - Implements an initial version of a project-aware CDS parser. - Creates a dedicated "cds" package at "extractors/cds/tools/src/cds". - Converts existing unit tests to use the new path for functions related to parsing and/or compiling .cds files.
This commit: - fixes a typo in a comment, as identified in a previous PR ( advanced-security#188 ); - updates the logic of the CDS extractor's `findPackageJsonDirs` function; - fixes a regression in the CDS extractor where a "project directory" was not properly recognized when its path was the same as the "source root" directory for the CDS extractor scan; - adds unit tests to cover edge cases idendified for the `findPackageJsonDirs` function.
Renames the entrypoint to the CDS extractor script and refactors its arguments in order to support using different "run modes" for the extractor, including: - "autobuild" : work-in-progress, just a stub right now; - "debug-parser" : using for debugging CDS project & file parsing; - "index-files" : legacy mode, useful for backwards compatibility; Updates the usage (help) message for the script to represent the required arguments for each of the currently planned run modes. Adds support for the "debug-parser" run mode, which debugs to a file under the `extractors/cds/tools/out/debug/` directory. Useful for in-progress rewrite of the CDS extractor to be more performant when running and more useful in terms of yielding a CodeQL database that allows for high-precision query results for CDS projects/queries.
Adds extended unit tests for the "parser" component of the CDS extractor, using the CDS projects nested under this repository's `javascript/frameworks/cap/test/queries` directory as testing targets and reference points for test cases.
Adds more extensive unit tests of CDS extractor code related to the use of the `cds` compiler. Adds unit tests for CDS extractor functions in "projectMapping.ts".
Fixes the setup of the CDS extractor environment to ensure that the codeql CLI can be reliably found and to avoid duplicate runs of the CDS parser's graph building process for "debug-parser" versus other run modes.
Cleans up DEBUG logging and improves existing CDS extractor logging in order to provide more useful indications of the CDS compiler version used to compile a given `*.cds.json` file.
Initial attempt to use the `cds compile` CLI command in a way that allows for de-duplication of individual `.cds` files that are already included by another `.cds` file in the project.
This commit: - Implements an initial version of a project-aware CDS parser. - Creates a dedicated "cds" package at "extractors/cds/tools/src/cds". - Converts existing unit tests to use the new path for functions related to parsing and/or compiling .cds files.
Renames the entrypoint to the CDS extractor script and refactors its arguments in order to support using different "run modes" for the extractor, including: - "autobuild" : work-in-progress, just a stub right now; - "debug-parser" : using for debugging CDS project & file parsing; - "index-files" : legacy mode, useful for backwards compatibility; Updates the usage (help) message for the script to represent the required arguments for each of the currently planned run modes. Adds support for the "debug-parser" run mode, which debugs to a file under the `extractors/cds/tools/out/debug/` directory. Useful for in-progress rewrite of the CDS extractor to be more performant when running and more useful in terms of yielding a CodeQL database that allows for high-precision query results for CDS projects/queries.
Adds extended unit tests for the "parser" component of the CDS extractor, using the CDS projects nested under this repository's `javascript/frameworks/cap/test/queries` directory as testing targets and reference points for test cases.
Adds more extensive unit tests of CDS extractor code related to the use of the `cds` compiler. Adds unit tests for CDS extractor functions in "projectMapping.ts".
Fixes the setup of the CDS extractor environment to ensure that the codeql CLI can be reliably found and to avoid duplicate runs of the CDS parser's graph building process for "debug-parser" versus other run modes.
Cleans up DEBUG logging and improves existing CDS extractor logging in order to provide more useful indications of the CDS compiler version used to compile a given `*.cds.json` file.
Initial attempt to use the `cds compile` CLI command in a way that allows for de-duplication of individual `.cds` files that are already included by another `.cds` file in the project.
…/codeql-sap-js into data-douser/cds-ts-rewrite-2
Updates the mermaid flowchart for the CDS extractor in order to reflect recent changes to how the CDS extractor actually works.
Fixes detection of .cds file in CDS projects by ensuring that "node_modules" subdirectories are explicitly ignored and "srv" and "db" subdirectories are explicitly included. Migrates some logic from cds-extractor.ts (entrypoint) script to testable functions under extractors/cds/tools/src/ directory. Adds and improves unit tests related to code changes from this commit.
Removes an unintended change in CDS compile (to .cds.json) behavior due to the (mis)use of the "--parse" command. Fixes a regression in the expected query results in at least one case: `javascript/frameworks/cap/src/sensitive-exposure/SensitiveExposure.ql`
Refactors cds extractor `src/cds/compiler` and `src/cds/parser` packages for improved maintainability. Simplifies the main logic of the CDS extractor such that we always build a graph that maps CDS projects to their imports / dependencies, which is part of the longer process of deprecating the "index-files" run mode of the CDS extractor (in favor of autobuild, eventually). Attempts to fix CDS file and project parsing for test projects such as: `javascript/frameworks/cap/test/queries/loginjection/log-injection-without-protocol-none`
Fixes a regression where the project base directory was being used to set the `cwd` of the process spawned for running the CDS compiler for "project-aware" compilation. Adds unit tests to ensure the `cwd` is always set to the value of the `sourceRoot` directory. Further refactoring of the `cds/compiler` and `cds/parser` packages within the source code of the CDS extractor. This commit is expected to actually cause more problems with existing queries, despite fixing the relative-file-path problem / regression. Some changes to existing CodeQL queries and/or expected results may be required as, at this point, the JSON data generated by the CDS compiler (via the CDS extractor) seems valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed the first half of the flow for CDS extraction, and left some comments.
|
||
for (const dir of cdsDirs) { | ||
// Skip if we've already processed this directory or a parent | ||
if (isDirectoryProcessed(dir, processedDirectories)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to confirm that there isn't a possibility for nested CDS projects.
} | ||
|
||
const projectDirs: string[] = []; | ||
const processedDirectories = new Set<string>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need both projectDirs
and processedDirs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, but when I tried to change it I caused a regression.
On accident, I confirmed that project-aware compilation works well on its own, but it doesn't mix well at all.
Assuming that a model.cds.json
file is created by compiling all of the .cds
files in a given project, the presence of any other .cds.json
files in the same project can lead to FP results in cases where annotations are defined separately from their associated services.
I plan to circle back to this conversation after I have fixed that regression in my local env.
This commit: - Fixes project-aware `cds compile` commands for projects with multiple directories and/or `.cds` files, including fixes for generating the `.cds.json` files that had been missing for such projects. - Attempts to make the CDS extractor more consistent, robust, and simple by using the project-graph-based approach as more of a common data reference (type) for all CDS extractor "run modes". - Attempts to make the CDS extractor more robust in terms of prioritizing the compilation / generation of `.cds.json` files. - Addresses some comments from initial peer review of PR advanced-security#195 for the "advanced-security/codeql-sap-js" repository.
Updates the `getImplementation` predicate of the `CdlService` class within the `CDL.qll` library in order to ensure that CDS-related CodeQL queries are still able to find the `UserDefinedApplicationService` that implements a given `CdlElement`, even when using "project aware" CDS compilation that results in a single `.cds.json` file being created for a given CDS project rather than a 1:1 relationship between each extracted `.cds` file and its `.cds.json` representation.
Attempts to resolve an "Indirect uncontrolled command line" code scanning alert from the recent additon of the `testCdsCommand` function. Uses the `quote` function from the `shell-quote` library to "quote" the offending CDS extractor script argument before using the arg / string within the `testCdsCommand` function.
Addresses a code scanning alert related to the recent introduction of the `Math.random()` function when creating a session ID for CDS extractor logging. Replaces `Math.random()` with `Data.now()` in an effort to remove any perception that the "session ID" is in any way used in a "security context".
Remove "run modes" in an attempt to simplify the overall CDS extractor.
Fixes a regression in the CDS extractor's processing of monorepos.
Updates the `extractors/cds/tools/.gitignore` file to explititly include the `dist/` and `node_modules/` subdirectories in order to support pre-build of the CDS extractor JS (compiled from TS) code.
Improves the efficiency and logging of package dependency installation tasks in the "packageManager" package of the CDS extractor source code. Responds to peer feedback on related PR advanced-security#195. Extends unit test coverage for the `src/packageManager` code.
@lcartey I believe that this version of the CDS extractor accounts for the case where an annotation is written in a separate |
Removes the `dist/` and `node_modules/` directories from git tracking for the CDS extractor. This might be temporary, but for now is needed for PR advanced-security#195 readability.
Cleanup remaining, known problems with the CDS extractor rewrite for PR advanced-security#195 , including: - Fixes a bug that was introduced to the `index-files.sh` script by the previous commit; - Removes dead `src/**/*.ts` code, where found; - Replaces hardcoded, system-local paths in CDS project files; - Improves the organization, logic, and testing of `src/cds/compiler` code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed some dead code and areas which could be simplified.
This commit: - removes unused functions from CDS extractor "src/cds/**/*.ts"; - removes logging of CDS extractor memeory usage; - renames some CDS extractor logging functions for consistency; - addresses PR comments for the `src/cds/parser/functions.ts` file.
Upates the `compileProjectLevel` function of the CDS extractor to just use the project directory as the first argument to the `cds compile` command, which simplifies the code while accounting for a wide variety of project (directory and file) structures.
This reverts commit 119aa29.
} | ||
} | ||
|
||
// Fourth pass: determine expected output files for each project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should calculate the output files in determineCdsFilesToCompile
- the code will be shorter and easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the latest version should be better. I'll keep this conversation open for now, though.
project.expectedOutputFiles, | ||
projectDir, | ||
true, | ||
10, // Higher priority for project-level compilation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of the priority system? Do we really need to prioritise project compilation over file compilation?
This PR implements the planned "Phase 2" of the full rewrite of the CDS extractor, focusing on improving performance and precision. It introduces significant changes to the CodeQL CDS extractor, including a major refactor of the extraction process, updates to scripts, and improvements to configuration and debugging. Throughout this multi-phase rewrite process, the approach has been documented in the
extractors/cds/tools/autobuild.md
file.This changes of this PR do not fully implement the "autobuild" run mode for the CDS extractor, but it gets reasonably close. New "run modes" were added to the renamed
cds-extractor.ts
script (formerlyindex-files.ts
), and the arguments to the script have been update to allow for run modes such asindex-files
(legacy),debug-parser
(new), andautobuild
(planned / WIP).While staying within the limitations of the
index-files
approach, this changes in the PR are an attempt to integrate parsing and compiling of.cds
files in a manner that is "project aware", meaning that we try to only compile the top-level.cds
files in an effort to avoid duplication of both compilation work and indexed.cds.json
files.Key Changes:
New Features and Functionality:
cds-extractor.ts
: Added a new script to handle CDS file processing, including project dependency graph building, environment setup, and integration with CodeQL tools. This script replaces the previousindex-files.js
script.cds-extractor.ts
script with different "run mode" values, includingautobuild
,debug-parser
, andindex-files
.cds-extractor.ts
script has been rewritten to include features like project dependency graph building, project-aware compilation, and diagnostic handling for CDS files. This enables more efficient and context-aware processing of CDS files.debug-parser
run mode of thecds-extractor
(node) script.Script Updates:
index-files.cmd
) Updates: Updated references fromindex-files.js
tocds-extractor.js
, added_run_mode
parameter, and adjusted logging and execution commands to align with the new script. [1] [2] [3]index-files.sh
) Updates: Similar updates as the batch script, including parameter additions and script name changes for consistency. [1] [2] [3] [4]Configuration Improvements:
eslint.config.mjs
): Refactored imports for better readability, updated rules and plugin configurations, and added comments to clarify TypeScript and JavaScript-specific settings.Miscellaneous:
.gitignore
Update: Added an entry to ignoredebug/
files created during debugging of the CDS extractor.Documentation Updates:
README.md
now reflects the newcds-extractor
workflow, replacing outdated references toindex-files
with the new process and steps for project-aware compilation. [1] [2]