Skip to content

CDS Extractor Rewrite Phase 2 : Improve Performance and Precision #195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 52 commits into
base: main
Choose a base branch
from

Conversation

data-douser
Copy link
Collaborator

@data-douser data-douser commented May 18, 2025

This PR implements the planned "Phase 2" of the full rewrite of the CDS extractor, focusing on improving performance and precision. It introduces significant changes to the CodeQL CDS extractor, including a major refactor of the extraction process, updates to scripts, and improvements to configuration and debugging. Throughout this multi-phase rewrite process, the approach has been documented in the extractors/cds/tools/autobuild.md file.

This changes of this PR do not fully implement the "autobuild" run mode for the CDS extractor, but it gets reasonably close. New "run modes" were added to the renamed cds-extractor.ts script (formerly index-files.ts), and the arguments to the script have been update to allow for run modes such as index-files (legacy), debug-parser (new), and autobuild (planned / WIP).

While staying within the limitations of the index-files approach, this changes in the PR are an attempt to integrate parsing and compiling of .cds files in a manner that is "project aware", meaning that we try to only compile the top-level .cds files in an effort to avoid duplication of both compilation work and indexed .cds.json files.

Key Changes:

New Features and Functionality:

  • Introduction of cds-extractor.ts: Added a new script to handle CDS file processing, including project dependency graph building, environment setup, and integration with CodeQL tools. This script replaces the previous index-files.js script.
  • Run Modes: Adds support for running the cds-extractor.ts script with different "run mode" values, including autobuild, debug-parser, and index-files.
  • New Project-Aware Dependency Handling: The cds-extractor.ts script has been rewritten to include features like project dependency graph building, project-aware compilation, and diagnostic handling for CDS files. This enables more efficient and context-aware processing of CDS files.
  • Debugging: Enhanced debugging of project dependency graph (post-parsing, pre-compilation) now supported via the debug-parser run mode of the cds-extractor (node) script.

Script Updates:

  • Batch Script (index-files.cmd) Updates: Updated references from index-files.js to cds-extractor.js, added _run_mode parameter, and adjusted logging and execution commands to align with the new script. [1] [2] [3]
  • Shell Script (index-files.sh) Updates: Similar updates as the batch script, including parameter additions and script name changes for consistency. [1] [2] [3] [4]

Configuration Improvements:

  • ESLint Configuration (eslint.config.mjs): Refactored imports for better readability, updated rules and plugin configurations, and added comments to clarify TypeScript and JavaScript-specific settings.

Miscellaneous:

  • .gitignore Update: Added an entry to ignore debug/ files created during debugging of the CDS extractor.

Documentation Updates:

  • Updated Workflow Diagrams: The README.md now reflects the new cds-extractor workflow, replacing outdated references to index-files with the new process and steps for project-aware compilation. [1] [2]

This commit:

- Implements an initial version of a project-aware CDS parser.
- Creates a dedicated "cds" package at "extractors/cds/tools/src/cds".
- Converts existing unit tests to use the new path for functions
  related to parsing and/or compiling .cds files.
This commit:

- fixes a typo in a comment, as identified in a previous PR ( advanced-security#188 );

- updates the logic of the CDS extractor's `findPackageJsonDirs` function;

- fixes a regression in the CDS extractor where a "project directory"
  was not properly recognized when its path was the same as the
  "source root" directory for the CDS extractor scan;

- adds unit tests to cover edge cases idendified for the
  `findPackageJsonDirs` function.
Renames the entrypoint to the CDS extractor script and refactors its
arguments in order to support using different "run modes" for the
extractor, including:

  - "autobuild" : work-in-progress, just a stub right now;
  - "debug-parser" : using for debugging CDS project & file parsing;
  - "index-files" : legacy mode, useful for backwards compatibility;

Updates the usage (help) message for the script to represent the
required arguments for each of the currently planned run modes.

Adds support for the "debug-parser" run mode, which debugs to a file
under the `extractors/cds/tools/out/debug/` directory. Useful for
in-progress rewrite of the CDS extractor to be more performant when
running and more useful in terms of yielding a CodeQL database that
allows for high-precision query results for CDS projects/queries.
Adds extended unit tests for the "parser" component of the CDS
extractor, using the CDS projects nested under this repository's
`javascript/frameworks/cap/test/queries` directory as testing
targets and reference points for test cases.
Adds more extensive unit tests of CDS extractor code related to the
use of the `cds` compiler.

Adds unit tests for CDS extractor functions in "projectMapping.ts".
Fixes the setup of the CDS extractor environment to ensure that the
codeql CLI can be reliably found and to avoid duplicate runs of
the CDS parser's graph building process for "debug-parser" versus
other run modes.
Cleans up DEBUG logging and improves existing CDS extractor logging
in order to provide more useful indications of the CDS compiler
version used to compile a given `*.cds.json` file.
Initial attempt to use the `cds compile` CLI command in a way that
allows for de-duplication of individual `.cds` files that are already
included by another `.cds` file in the project.
This commit:

- Implements an initial version of a project-aware CDS parser.
- Creates a dedicated "cds" package at "extractors/cds/tools/src/cds".
- Converts existing unit tests to use the new path for functions
  related to parsing and/or compiling .cds files.
Renames the entrypoint to the CDS extractor script and refactors its
arguments in order to support using different "run modes" for the
extractor, including:

  - "autobuild" : work-in-progress, just a stub right now;
  - "debug-parser" : using for debugging CDS project & file parsing;
  - "index-files" : legacy mode, useful for backwards compatibility;

Updates the usage (help) message for the script to represent the
required arguments for each of the currently planned run modes.

Adds support for the "debug-parser" run mode, which debugs to a file
under the `extractors/cds/tools/out/debug/` directory. Useful for
in-progress rewrite of the CDS extractor to be more performant when
running and more useful in terms of yielding a CodeQL database that
allows for high-precision query results for CDS projects/queries.
Adds extended unit tests for the "parser" component of the CDS
extractor, using the CDS projects nested under this repository's
`javascript/frameworks/cap/test/queries` directory as testing
targets and reference points for test cases.
Adds more extensive unit tests of CDS extractor code related to the
use of the `cds` compiler.

Adds unit tests for CDS extractor functions in "projectMapping.ts".
Fixes the setup of the CDS extractor environment to ensure that the
codeql CLI can be reliably found and to avoid duplicate runs of
the CDS parser's graph building process for "debug-parser" versus
other run modes.
Cleans up DEBUG logging and improves existing CDS extractor logging
in order to provide more useful indications of the CDS compiler
version used to compile a given `*.cds.json` file.
Initial attempt to use the `cds compile` CLI command in a way that
allows for de-duplication of individual `.cds` files that are already
included by another `.cds` file in the project.
…/codeql-sap-js into data-douser/cds-ts-rewrite-2
Updates the mermaid flowchart for the CDS extractor in order to
reflect recent changes to how the CDS extractor actually works.
Copilot

This comment was marked as outdated.

data-douser and others added 7 commits June 7, 2025 18:26
Fixes detection of .cds file in CDS projects by ensuring that
"node_modules" subdirectories are explicitly ignored and "srv" and "db"
subdirectories are explicitly included.

Migrates some logic from cds-extractor.ts (entrypoint) script to
testable functions under extractors/cds/tools/src/ directory.

Adds and improves unit tests related to code changes from this commit.
Removes an unintended change in CDS compile (to .cds.json) behavior
due to the (mis)use of the "--parse" command.

Fixes a regression in the expected query results in at least one case:

`javascript/frameworks/cap/src/sensitive-exposure/SensitiveExposure.ql`
Refactors cds extractor `src/cds/compiler` and `src/cds/parser`
packages for improved maintainability.

Simplifies the main logic of the CDS extractor such that we always
build a graph that maps CDS projects to their imports / dependencies,
which is part of the longer process of deprecating the "index-files"
run mode of the CDS extractor (in favor of autobuild, eventually).

Attempts to fix CDS file and project parsing for test projects such as:
`javascript/frameworks/cap/test/queries/loginjection/log-injection-without-protocol-none`
Fixes a regression where the project base directory was being used
to set the `cwd` of the process spawned for running the CDS compiler
for "project-aware" compilation. Adds unit tests to ensure the `cwd`
is always set to the value of the `sourceRoot` directory.

Further refactoring of the `cds/compiler` and `cds/parser` packages
within the source code of the CDS extractor.

This commit is expected to actually cause more problems with existing
queries, despite fixing the relative-file-path problem / regression.
Some changes to existing CodeQL queries and/or expected results may be
required as, at this point, the JSON data generated by the CDS compiler
(via the CDS extractor) seems valid.
@data-douser data-douser requested a review from Copilot June 11, 2025 15:57
Copy link
Contributor

@lcartey lcartey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the first half of the flow for CDS extraction, and left some comments.


for (const dir of cdsDirs) {
// Skip if we've already processed this directory or a parent
if (isDirectoryProcessed(dir, processedDirectories)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to confirm that there isn't a possibility for nested CDS projects.

}

const projectDirs: string[] = [];
const processedDirectories = new Set<string>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both projectDirs and processedDirs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, but when I tried to change it I caused a regression.

On accident, I confirmed that project-aware compilation works well on its own, but it doesn't mix well at all.
Assuming that a model.cds.json file is created by compiling all of the .cds files in a given project, the presence of any other .cds.json files in the same project can lead to FP results in cases where annotations are defined separately from their associated services.

I plan to circle back to this conversation after I have fixed that regression in my local env.

This commit:

- Fixes project-aware `cds compile` commands for projects with multiple
  directories and/or `.cds` files, including fixes for generating the
  `.cds.json` files that had been missing for such projects.

- Attempts to make the CDS extractor more consistent, robust, and simple
  by using the project-graph-based approach as more of a common data
  reference (type) for all CDS extractor "run modes".

- Attempts to make the CDS extractor more robust in terms of
  prioritizing the compilation / generation of `.cds.json` files.

- Addresses some comments from initial peer review of PR advanced-security#195 for the
  "advanced-security/codeql-sap-js" repository.
Updates the `getImplementation` predicate of the `CdlService` class
within the `CDL.qll` library in order to ensure that CDS-related CodeQL
queries are still able to find the `UserDefinedApplicationService`
that implements a given `CdlElement`, even when using "project aware"
CDS compilation that results in a single `.cds.json` file being created
for a given CDS project rather than a 1:1 relationship between each
extracted `.cds` file and its `.cds.json` representation.
Attempts to resolve an "Indirect uncontrolled command line" code
scanning alert from the recent additon of the `testCdsCommand` function.

Uses the `quote` function from the `shell-quote` library to "quote" the
offending CDS extractor script argument before using the arg / string
within the `testCdsCommand` function.
Addresses a code scanning alert related to the recent introduction
of the `Math.random()` function when creating a session ID for CDS
extractor logging. Replaces `Math.random()` with `Data.now()` in an
effort to remove any perception that the "session ID" is in any way
used in a "security context".
Remove "run modes" in an attempt to simplify the overall CDS extractor.
Fixes a regression in the CDS extractor's processing of monorepos.
Updates the `extractors/cds/tools/.gitignore` file to explititly include
the `dist/` and `node_modules/` subdirectories in order to support
pre-build of the CDS extractor JS (compiled from TS) code.
Improves the efficiency and logging of package dependency installation
tasks in the "packageManager" package of the CDS extractor source code.

Responds to peer feedback on related PR advanced-security#195.

Extends unit test coverage for the `src/packageManager` code.
@data-douser data-douser requested a review from lcartey July 1, 2025 03:12
@data-douser
Copy link
Collaborator Author

@lcartey I believe that this version of the CDS extractor accounts for the case where an annotation is written in a separate .cds file. Because the latest approach uses a single cds compile command to compile all of the .cds files for a given CAP project, the js/cap-entity-exposed-without-authentication query should no longer return FP results related to the TestService in the cloud-cap-samples/bookshop project.

Removes the `dist/` and `node_modules/` directories from git tracking
for the CDS extractor. This might be temporary, but for now is needed
for PR advanced-security#195 readability.
Cleanup remaining, known problems with the CDS extractor rewrite
for PR advanced-security#195 , including:

- Fixes a bug that was introduced to the `index-files.sh` script by the
  previous commit;
- Removes dead `src/**/*.ts` code, where found;
- Replaces hardcoded, system-local paths in CDS project files;
- Improves the organization, logic, and testing of `src/cds/compiler`
  code.
Copy link
Contributor

@lcartey lcartey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed some dead code and areas which could be simplified.

This commit:

- removes unused functions from CDS extractor "src/cds/**/*.ts";
- removes logging of CDS extractor memeory usage;
- renames some CDS extractor logging functions for consistency;
- addresses PR comments for the `src/cds/parser/functions.ts` file.
Upates the `compileProjectLevel` function of the CDS extractor to just
use the project directory as the first argument to the `cds compile`
command, which simplifies the code while accounting for a wide variety
of project (directory and file) structures.
}
}

// Fourth pass: determine expected output files for each project
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should calculate the output files in determineCdsFilesToCompile - the code will be shorter and easier to understand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the latest version should be better. I'll keep this conversation open for now, though.

project.expectedOutputFiles,
projectDir,
true,
10, // Higher priority for project-level compilation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the priority system? Do we really need to prioritise project compilation over file compilation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request javascript Pull requests that update javascript code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants