Prototype extended data #660

spoorcc · 2025-02-01T22:56:55Z

This is a prototype of enriching the metadata (.dfetch_data.yml) with data about each individual fetched file.

files:
- <path>|<SHA-1 hash>|<file permissions>

This could help solve a few issues:

Non destructive file update in root and sub directories #616 Non destructive file updates --> only tracked files are deleted
Use dfetch for tracking upstream template #334 Using dfetch for template tracking --> only tracked files are deleted
Hash directory also hashes untracked files #350 Hashing untracked files --> only tracked files are hashed
Diff: metadata file part of patch #267 Metadata part of patch --> only tracked files are hashed
Line-endings not configurable #90 Line endings --> hash can be implemented with normalized line endings

Next to that, it may assist in more enhancements later on:

Report what exact file changed
Also check changed file permissions
Multiple single files in same dir? May need to bring multiple projects in a single metadata file.
dfetch diff only creates a patch for fetched files.

A disadvantage would be that the metadata file becomes longer and dfetch becomes slower.
This branch sort of works, but it is proof-of-concept. Some things still required:

normalized path separators in file paths.
optimize / reduce data in metadata.
determining what files were fetched and what files were existing.
showing error if overwriting a file during an update and no --force was provided.
how to handle overlapping projects

Note

Since this may have significant impact on users, I would especially like to get the metadata layout right.
Backwards compatibility is off-course something that is a must to not annoy current users.
I'm looking for any feedback, positive or negative 😉 ( @jgeudens @sach-edna @deminngi
) .

Is a bigger and changed manifest & potential performance impact worth the additional possibilities?
Should dfetch make it possible to skip adding the extra info?
Do you see any problems I'm overlooking?
Should dfetch track more outside the content and permissions?

Here is a part of the beginning of the metadata file from an example project:

# This is a generated file by dfetch. Don't edit this, but edit the manifest.
# For more info see https://dfetch.rtfd.io/en/latest/getting_started.html
dfetch:
  remote_url: https://github.com/cpputest/cpputest.git
  branch: master
  revision: ''
  last_fetch: 01/02/2025, 21:45:26
  tag: v3.4
  hash: ade6fb38b21bee516ffc657068c7058d
  patch: ''
  files:
  - docs/WalkThrough_VS21010.docx|b6d355e565db026333573739df74b499dd6dae90|666
  - makeVc6.bat|92fbc98a4d82c40aca5e0833a98daef552b37712|666
  - cpputest.pc.in|01c412b209793b6e9ea9d591341ee549ce06da05|666
  - tests/TestHarness_cTestCFile.c|e6f64813ba90a04b13bbe7bc2bd30c4c21bde449|666
  - tests/AllTests.vcproj|549a9949a25ff5ec0bd074df485f42e8090cdeef|666
  - tests/MemoryLeakDetectorTest.cpp|9c23efb92a5ccb04650aad7dbd45409bab290f1d|666
  - tests/CommandLineArgumentsTest.cpp|5acb1877eee2b7f55ac8871d6e40d4351709e9ac|666
  - tests/AllocLetTestFreeTest.cpp|ad22965b1decd3609bf227a708a65185b55d2cf2|666
  - tests/AllocLetTestFree.c|970be9610f9a7bd23b5ab45b8fa82b01f08bf419|666
  - tests/JUnitOutputTest.cpp|9f8528dbd8060709c5da79cb327a64323c8aabc9|666
  - tests/TestMemoryAllocatorTest.cpp|5c59b8eb2d57bdec178b1b30ef8586534930be24|666
  - tests/AllocationInCppFile.h|88e6bf900873c193d2acd0042bf40910805f90ab|666
  - tests/UtestTest.cpp|88ae650bf6f1f579cbbee206bf2e8505cf9cf795|666
  - tests/TestOutputTest.cpp|ff410d97e5db0f422732d66e276da36bebdd0f89|666
  - tests/TestHarness_cTest.cpp|456e0a6f32429540b5f3f998b9fba0dbe85580e1|666
  - tests/CppUTestExt/TestMockExpectedFunctionsList.cpp|064a721cda26f5b95ea180b5587e4b1fd547137d|666
  - tests/CppUTestExt/TestMemoryReportAllocator.cpp|a0892be7ab4c81ed50a7f50f8c10fa29066b8cd1|666
  - tests/CppUTestExt/TestMemoryReporterPlugin.cpp|fd09b5388139446301a2de109f87dc2738ad4f5a|666
  - tests/CppUTestExt/TestMockCheatSheet.cpp|c663bdbe20f2b9df17dda636e79f17dafd1487d3|666
  - tests/CppUTestExt/TestMockSupport.cpp|dc65526c4e4c70a588b13da583480958a8349aec|666

ben-edna · 2025-03-07T16:04:34Z

/korbit-review

korbit-ai

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.

Category	Issue	Fix Detected
	Missing File Existence Check ▹ view
	Undocumented Regex Pattern ▹ view
	Incorrect return type annotation for files property ▹ view
	Complex Boolean Logic ▹ view
	Missing Error Handling in Directory Traversal ▹ view
	Missing Type Hints ▹ view
	Directory Traversal Vulnerability ▹ view
	Unexplained Error Suppression ▹ view
	Redundant file operations ▹ view

Files scanned

File Path	Reviewed
dfetch/util/util.py	✅
dfetch/project/metadata.py	✅
dfetch/project/vcs.py	✅

Explore our documentation to understand the languages and file types we support and the files we ignore.

Need a new review? Comment /korbit-review on this PR and I'll review your latest changes.

Korbit Guide: Usage and Customization

Interacting with Korbit

You can manually ask Korbit to review your PR using the /korbit-review command in a comment at the root of your PR.

You can ask Korbit to generate a new PR description using the /korbit-generate-pr-description command in any comment on your PR.

Too many Korbit comments? I can resolve all my comment threads if you use the /korbit-resolve command in any comment on your PR.

On any given comment that Korbit raises on your pull request, you can have a discussion with Korbit by replying to the comment.

Help train Korbit to improve your reviews by giving a 👍 or 👎 on the comments Korbit posts.

Customizing Korbit

Check out our docs on how you can make Korbit work best for you and your team.

Customize Korbit for your organization through the Korbit Console.

Current Korbit Configuration

General Settings

Setting Value

Review Schedule Automatic excluding drafts

Max Issue Count 10

Automatic PR Descriptions ✅

Issue Categories

Category Enabled

Documentation ✅

Logging ✅

Error Handling ✅

Readability ✅

Design ✅

Performance ✅

Security ✅

Functionality ✅

Feedback and Support

Tell us what you think of Korbit

Schedule a call with our team

Email us @ [email protected]

korbit-ai · 2025-03-07T16:12:35Z

dfetch/project/vcs.py

+            full_path = os.path.join(self.local_path, file.path)
+            if hash_file_normalized(full_path).hexdigest() != file.hash:


Missing File Existence Check

Tell me more

What is the issue?

The code doesn't check if the file exists before attempting to hash it, which could cause crashes.

Why this matters

If a file was deleted but still exists in the metadata, this will raise an unhandled exception when trying to hash a non-existent file.

Suggested change ∙ Feature Preview

Add file existence check before hashing:

full_path = os.path.join(self.local_path, file.path) if not os.path.exists(full_path): logger.debug(f"File {full_path} no longer exists!") return True if hash_file_normalized(full_path).hexdigest() != file.hash:

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-03-07T16:12:36Z

dfetch/util/util.py

+    digest = hashlib.sha1(usedforsecurity=False)
+
+    if os.path.isfile(file_path):
+        normalize_re = re.compile(b"\r\n|\r")


Undocumented Regex Pattern

Tell me more

What is the issue?

Regular expression pattern is defined inside the function without explanation of what it matches.

Why this matters

Complex regex patterns without documentation or clear variable names make the code harder to understand and maintain.

Suggested change ∙ Feature Preview

# Define at module level with clear name LINE_ENDING_PATTERN = re.compile(b"\r\n|\r") # Matches Windows (CRLF) and old Mac (CR) line endings

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-03-07T16:12:36Z

dfetch/project/metadata.py

+    @property
+    def files(self) -> Iterable[FileInfo]:
+        """File info as stored in the metadata."""
+        return self._files


Incorrect return type annotation for files property

Tell me more

What is the issue?

The files property returns Optional[Iterable[FileInfo]] but is annotated to return Iterable[FileInfo]

Why this matters

This type mismatch could cause runtime errors when consumers expect a non-None return value but receive None

Suggested change ∙ Feature Preview

Update the return type annotation to match the actual return type:

@property def files(self) -> Optional[Iterable[FileInfo]]: """File info as stored in the metadata.""" return self._files

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-03-07T16:12:36Z

dfetch/project/vcs.py

+        if not file_info:
+            return bool(on_disk_hash) and on_disk_hash != hash_directory(
+                self.local_path, skiplist=[self.__metadata.FILENAME]
+            )


Complex Boolean Logic

Tell me more

What is the issue?

Complex boolean expression with unclear fallback logic mixing multiple conditions.

Why this matters

The nested conditions and boolean operations make it difficult to understand the flow and intention of the code at a glance.

Suggested change ∙ Feature Preview

Split into more explicit conditions:

if not file_info: if not on_disk_hash: return False current_hash = hash_directory(self.local_path, skiplist=[self.__metadata.FILENAME]) return on_disk_hash != current_hash

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-03-07T16:12:36Z

dfetch/util/util.py

+def recursive_listdir(directory):
+    """List all entries in the current directory."""
+    entries = os.listdir(directory)


Missing Error Handling in Directory Traversal

Tell me more

What is the issue?

The function recursive_listdir() doesn't handle potential permission errors or broken symlinks when accessing directories.

Why this matters

If the function encounters a directory without read permissions or a broken symlink, it will raise an unhandled OSError/PermissionError, causing the entire directory traversal to fail.

Suggested change ∙ Feature Preview

Add error handling to gracefully skip inaccessible directories:

def recursive_listdir(directory): """List all entries in the current directory.""" try: entries = os.listdir(directory) for entry in entries: full_path = os.path.join(directory, entry) try: if os.path.isdir(full_path): yield from recursive_listdir(full_path) else: yield full_path except (OSError, PermissionError): continue except (OSError, PermissionError): return

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-03-07T16:12:36Z

dfetch/util/util.py

@@ -102,6 +103,21 @@
    ]


+def recursive_listdir(directory):


Missing Type Hints

Tell me more

What is the issue?

Missing type hints for both the parameter and return type in the recursive_listdir function.

Why this matters

Type hints help with code understanding, IDE support, and static type checking. Their absence makes it harder to understand what the function expects and returns without diving into the implementation.

Suggested change ∙ Feature Preview

def recursive_listdir(directory: str) -> Generator[str, None, None]:

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-03-07T16:12:36Z

dfetch/util/util.py

+    entries = os.listdir(directory)
+
+    for entry in entries:
+        full_path = os.path.join(directory, entry)


Directory Traversal Vulnerability

Tell me more

What is the issue?

The recursive_listdir function is vulnerable to directory traversal attacks if the input directory path is not validated.

Why this matters

Without path validation, malicious input could potentially access files outside the intended directory tree through symbolic links or relative paths.

Suggested change ∙ Feature Preview

Add path validation and resolve symbolic links:

directory = os.path.abspath(directory) if not os.path.realpath(directory).startswith(os.path.realpath(safe_root)): raise ValueError("Access denied: Directory outside allowed path")

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-03-07T16:12:36Z

dfetch/project/vcs.py

+            with suppress(TypeError):
+                metadata_files = Metadata.from_file(self.__metadata.path).files


Unexplained Error Suppression

Tell me more

What is the issue?

Silent error suppression without explaining why TypeError is expected or can be safely ignored.

Why this matters

Code maintainers will have to dig through the codebase to understand why this error is suppressed, making the code harder to understand and maintain.

Suggested change ∙ Feature Preview

Add a comment explaining the rationale:

# Suppress TypeError when metadata file is invalid or has old format without 'files' field with suppress(TypeError): metadata_files = Metadata.from_file(self.__metadata.path).files

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-03-07T16:12:36Z

dfetch/project/vcs.py

+            files_list = (
+                FileInfo(
+                    os.path.basename(self.local_path),
+                    hash_file_normalized(os.path.join(self.local_path)).hexdigest(),
+                    oct(os.stat(os.path.join(self.local_path)).st_mode)[-3:],
+                ),
+            )


Redundant file operations

Tell me more

What is the issue?

Redundant os.path.join calls and file stat operations

Why this matters

Multiple system calls to the same file wastes I/O operations which impacts performance, especially when dealing with many files

Suggested change ∙ Feature Preview

Cache the joined path and file stat results:

full_path = os.path.join(self.local_path) stat_result = os.stat(full_path) files_list = ( FileInfo( os.path.basename(self.local_path), hash_file_normalized(full_path).hexdigest(), oct(stat_result.st_mode)[-3:], ), )

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

sach-edna · 2025-03-11T08:00:09Z

According to the Yaml specification | is used for

block scalars are delimited with indentation with optional modifiers to preserve (|) or fold (>) newlines.
So using it to delimite this data may cause some confusion for parsers.

korbit-ai bot reviewed Mar 7, 2025

View reviewed changes

spoorcc added 4 commits April 10, 2025 15:29

Store data per fetched file

37bab75

Add list of files as last entry in metadata (important info first)

d44017c

Check for local changes on per-file basis

9b2f2e9

Only remove fetched files

7db9b3e

ben-edna force-pushed the prototype-extended-data branch from 635cf97 to 7db9b3e Compare April 10, 2025 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prototype extended data #660

Prototype extended data #660

Uh oh!

spoorcc commented Feb 1, 2025 •

edited by ben-edna

Loading

Uh oh!

ben-edna commented Mar 7, 2025

Uh oh!

korbit-ai bot left a comment •

edited

Loading

Uh oh!

korbit-ai bot Mar 7, 2025

Uh oh!

korbit-ai bot Mar 7, 2025

Uh oh!

korbit-ai bot Mar 7, 2025

Uh oh!

korbit-ai bot Mar 7, 2025

Uh oh!

korbit-ai bot Mar 7, 2025

Uh oh!

korbit-ai bot Mar 7, 2025

Uh oh!

korbit-ai bot Mar 7, 2025

Uh oh!

korbit-ai bot Mar 7, 2025

Uh oh!

korbit-ai bot Mar 7, 2025

Uh oh!

sach-edna commented Mar 11, 2025

Uh oh!

Uh oh!

Setting	Value
Review Schedule	Automatic excluding drafts
Max Issue Count	10
Automatic PR Descriptions	✅

Category	Enabled
Documentation	✅
Logging	✅
Error Handling	✅
Readability	✅
Design	✅
Performance	✅
Security	✅
Functionality	✅

		full_path = os.path.join(self.local_path, file.path)
		if hash_file_normalized(full_path).hexdigest() != file.hash:

		with suppress(TypeError):
		metadata_files = Metadata.from_file(self.__metadata.path).files

Prototype extended data #660

Are you sure you want to change the base?

Prototype extended data #660

Uh oh!

Conversation

spoorcc commented Feb 1, 2025 • edited by ben-edna Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ben-edna commented Mar 7, 2025

Uh oh!

korbit-ai bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.

Interacting with Korbit

Customizing Korbit

Current Korbit Configuration

Feedback and Support

Uh oh!

korbit-ai bot Mar 7, 2025

Choose a reason for hiding this comment

Missing File Existence Check

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Uh oh!

korbit-ai bot Mar 7, 2025

Choose a reason for hiding this comment

Undocumented Regex Pattern

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Uh oh!

korbit-ai bot Mar 7, 2025

Choose a reason for hiding this comment

Incorrect return type annotation for files property

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Uh oh!

korbit-ai bot Mar 7, 2025

Choose a reason for hiding this comment

Complex Boolean Logic

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Uh oh!

korbit-ai bot Mar 7, 2025

Choose a reason for hiding this comment

Missing Error Handling in Directory Traversal

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Uh oh!

korbit-ai bot Mar 7, 2025

Choose a reason for hiding this comment

Missing Type Hints

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Uh oh!

korbit-ai bot Mar 7, 2025

Choose a reason for hiding this comment

Directory Traversal Vulnerability

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Uh oh!

korbit-ai bot Mar 7, 2025

Choose a reason for hiding this comment

Unexplained Error Suppression

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Uh oh!

korbit-ai bot Mar 7, 2025

Choose a reason for hiding this comment

Redundant file operations

What is the issue?

spoorcc commented Feb 1, 2025 •

edited by ben-edna

Loading

korbit-ai bot left a comment •

edited

Loading