Skip to content

Conversation

spoorcc
Copy link
Contributor

@spoorcc spoorcc commented Feb 1, 2025

This is a prototype of enriching the metadata (.dfetch_data.yml) with data about each individual fetched file.

files:
- <path>|<SHA-1 hash>|<file permissions>

This could help solve a few issues:

Next to that, it may assist in more enhancements later on:

  • Report what exact file changed
  • Also check changed file permissions
  • Multiple single files in same dir? May need to bring multiple projects in a single metadata file.
  • dfetch diff only creates a patch for fetched files.

A disadvantage would be that the metadata file becomes longer and dfetch becomes slower.
This branch sort of works, but it is proof-of-concept. Some things still required:

  • normalized path separators in file paths.
  • optimize / reduce data in metadata.
  • determining what files were fetched and what files were existing.
  • showing error if overwriting a file during an update and no --force was provided.
  • how to handle overlapping projects

Note

Since this may have significant impact on users, I would especially like to get the metadata layout right.
Backwards compatibility is off-course something that is a must to not annoy current users.
I'm looking for any feedback, positive or negative 😉 ( @jgeudens @sach-edna @deminngi
) .

  • Is a bigger and changed manifest & potential performance impact worth the additional possibilities?
  • Should dfetch make it possible to skip adding the extra info?
  • Do you see any problems I'm overlooking?
  • Should dfetch track more outside the content and permissions?

Here is a part of the beginning of the metadata file from an example project:

# This is a generated file by dfetch. Don't edit this, but edit the manifest.
# For more info see https://dfetch.rtfd.io/en/latest/getting_started.html
dfetch:
  remote_url: https://github.com/cpputest/cpputest.git
  branch: master
  revision: ''
  last_fetch: 01/02/2025, 21:45:26
  tag: v3.4
  hash: ade6fb38b21bee516ffc657068c7058d
  patch: ''
  files:
  - docs/WalkThrough_VS21010.docx|b6d355e565db026333573739df74b499dd6dae90|666
  - makeVc6.bat|92fbc98a4d82c40aca5e0833a98daef552b37712|666
  - cpputest.pc.in|01c412b209793b6e9ea9d591341ee549ce06da05|666
  - tests/TestHarness_cTestCFile.c|e6f64813ba90a04b13bbe7bc2bd30c4c21bde449|666
  - tests/AllTests.vcproj|549a9949a25ff5ec0bd074df485f42e8090cdeef|666
  - tests/MemoryLeakDetectorTest.cpp|9c23efb92a5ccb04650aad7dbd45409bab290f1d|666
  - tests/CommandLineArgumentsTest.cpp|5acb1877eee2b7f55ac8871d6e40d4351709e9ac|666
  - tests/AllocLetTestFreeTest.cpp|ad22965b1decd3609bf227a708a65185b55d2cf2|666
  - tests/AllocLetTestFree.c|970be9610f9a7bd23b5ab45b8fa82b01f08bf419|666
  - tests/JUnitOutputTest.cpp|9f8528dbd8060709c5da79cb327a64323c8aabc9|666
  - tests/TestMemoryAllocatorTest.cpp|5c59b8eb2d57bdec178b1b30ef8586534930be24|666
  - tests/AllocationInCppFile.h|88e6bf900873c193d2acd0042bf40910805f90ab|666
  - tests/UtestTest.cpp|88ae650bf6f1f579cbbee206bf2e8505cf9cf795|666
  - tests/TestOutputTest.cpp|ff410d97e5db0f422732d66e276da36bebdd0f89|666
  - tests/TestHarness_cTest.cpp|456e0a6f32429540b5f3f998b9fba0dbe85580e1|666
  - tests/CppUTestExt/TestMockExpectedFunctionsList.cpp|064a721cda26f5b95ea180b5587e4b1fd547137d|666
  - tests/CppUTestExt/TestMemoryReportAllocator.cpp|a0892be7ab4c81ed50a7f50f8c10fa29066b8cd1|666
  - tests/CppUTestExt/TestMemoryReporterPlugin.cpp|fd09b5388139446301a2de109f87dc2738ad4f5a|666
  - tests/CppUTestExt/TestMockCheatSheet.cpp|c663bdbe20f2b9df17dda636e79f17dafd1487d3|666
  - tests/CppUTestExt/TestMockSupport.cpp|dc65526c4e4c70a588b13da583480958a8349aec|666

@ben-edna
Copy link
Contributor

ben-edna commented Mar 7, 2025

/korbit-review

Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Fix Detected
Error Handling Missing File Existence Check ▹ view
Readability Undocumented Regex Pattern ▹ view
Functionality Incorrect return type annotation for files property ▹ view
Readability Complex Boolean Logic ▹ view
Error Handling Missing Error Handling in Directory Traversal ▹ view
Readability Missing Type Hints ▹ view
Security Directory Traversal Vulnerability ▹ view
Readability Unexplained Error Suppression ▹ view
Performance Redundant file operations ▹ view
Files scanned
File Path Reviewed
dfetch/util/util.py
dfetch/project/metadata.py
dfetch/project/vcs.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Need a new review? Comment /korbit-review on this PR and I'll review your latest changes.

Korbit Guide: Usage and Customization

Interacting with Korbit

  • You can manually ask Korbit to review your PR using the /korbit-review command in a comment at the root of your PR.
  • You can ask Korbit to generate a new PR description using the /korbit-generate-pr-description command in any comment on your PR.
  • Too many Korbit comments? I can resolve all my comment threads if you use the /korbit-resolve command in any comment on your PR.
  • On any given comment that Korbit raises on your pull request, you can have a discussion with Korbit by replying to the comment.
  • Help train Korbit to improve your reviews by giving a 👍 or 👎 on the comments Korbit posts.

Customizing Korbit

  • Check out our docs on how you can make Korbit work best for you and your team.
  • Customize Korbit for your organization through the Korbit Console.

Current Korbit Configuration

General Settings
Setting Value
Review Schedule Automatic excluding drafts
Max Issue Count 10
Automatic PR Descriptions
Issue Categories
Category Enabled
Documentation
Logging
Error Handling
Readability
Design
Performance
Security
Functionality

Feedback and Support

Comment on lines +392 to +402
full_path = os.path.join(self.local_path, file.path)
if hash_file_normalized(full_path).hexdigest() != file.hash:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing File Existence Check category Error Handling

Tell me more
What is the issue?

The code doesn't check if the file exists before attempting to hash it, which could cause crashes.

Why this matters

If a file was deleted but still exists in the metadata, this will raise an unhandled exception when trying to hash a non-existent file.

Suggested change ∙ Feature Preview

Add file existence check before hashing:

full_path = os.path.join(self.local_path, file.path)
if not os.path.exists(full_path):
    logger.debug(f"File {full_path} no longer exists!")
    return True
if hash_file_normalized(full_path).hexdigest() != file.hash:

Report a problem with this comment

💬 Looking for more details? Reply to this comment to chat with Korbit.

digest = hashlib.sha1(usedforsecurity=False)

if os.path.isfile(file_path):
normalize_re = re.compile(b"\r\n|\r")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undocumented Regex Pattern category Readability

Tell me more
What is the issue?

Regular expression pattern is defined inside the function without explanation of what it matches.

Why this matters

Complex regex patterns without documentation or clear variable names make the code harder to understand and maintain.

Suggested change ∙ Feature Preview
# Define at module level with clear name
LINE_ENDING_PATTERN = re.compile(b"\r\n|\r")  # Matches Windows (CRLF) and old Mac (CR) line endings

Report a problem with this comment

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +132 to +133
@property
def files(self) -> Iterable[FileInfo]:
"""File info as stored in the metadata."""
return self._files
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect return type annotation for files property category Functionality

Tell me more
What is the issue?

The files property returns Optional[Iterable[FileInfo]] but is annotated to return Iterable[FileInfo]

Why this matters

This type mismatch could cause runtime errors when consumers expect a non-None return value but receive None

Suggested change ∙ Feature Preview

Update the return type annotation to match the actual return type:

@property
def files(self) -> Optional[Iterable[FileInfo]]:
    """File info as stored in the metadata."""
    return self._files

Report a problem with this comment

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +386 to +398
if not file_info:
return bool(on_disk_hash) and on_disk_hash != hash_directory(
self.local_path, skiplist=[self.__metadata.FILENAME]
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complex Boolean Logic category Readability

Tell me more
What is the issue?

Complex boolean expression with unclear fallback logic mixing multiple conditions.

Why this matters

The nested conditions and boolean operations make it difficult to understand the flow and intention of the code at a glance.

Suggested change ∙ Feature Preview

Split into more explicit conditions:

if not file_info:
    if not on_disk_hash:
        return False
    current_hash = hash_directory(self.local_path, skiplist=[self.__metadata.FILENAME])
    return on_disk_hash != current_hash

Report a problem with this comment

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +106 to +110
def recursive_listdir(directory):
"""List all entries in the current directory."""
entries = os.listdir(directory)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Error Handling in Directory Traversal category Error Handling

Tell me more
What is the issue?

The function recursive_listdir() doesn't handle potential permission errors or broken symlinks when accessing directories.

Why this matters

If the function encounters a directory without read permissions or a broken symlink, it will raise an unhandled OSError/PermissionError, causing the entire directory traversal to fail.

Suggested change ∙ Feature Preview

Add error handling to gracefully skip inaccessible directories:

def recursive_listdir(directory):
    """List all entries in the current directory."""
    try:
        entries = os.listdir(directory)
        
        for entry in entries:
            full_path = os.path.join(directory, entry)
            
            try:
                if os.path.isdir(full_path):
                    yield from recursive_listdir(full_path)
                else:
                    yield full_path
            except (OSError, PermissionError):
                continue
    except (OSError, PermissionError):
        return

Report a problem with this comment

💬 Looking for more details? Reply to this comment to chat with Korbit.

@@ -102,6 +103,21 @@
]


def recursive_listdir(directory):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Type Hints category Readability

Tell me more
What is the issue?

Missing type hints for both the parameter and return type in the recursive_listdir function.

Why this matters

Type hints help with code understanding, IDE support, and static type checking. Their absence makes it harder to understand what the function expects and returns without diving into the implementation.

Suggested change ∙ Feature Preview
def recursive_listdir(directory: str) -> Generator[str, None, None]:

Report a problem with this comment

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +108 to +113
entries = os.listdir(directory)

for entry in entries:
full_path = os.path.join(directory, entry)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directory Traversal Vulnerability category Security

Tell me more
What is the issue?

The recursive_listdir function is vulnerable to directory traversal attacks if the input directory path is not validated.

Why this matters

Without path validation, malicious input could potentially access files outside the intended directory tree through symbolic links or relative paths.

Suggested change ∙ Feature Preview

Add path validation and resolve symbolic links:

directory = os.path.abspath(directory)
if not os.path.realpath(directory).startswith(os.path.realpath(safe_root)):
    raise ValueError("Access denied: Directory outside allowed path")

Report a problem with this comment

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +114 to +122
with suppress(TypeError):
metadata_files = Metadata.from_file(self.__metadata.path).files
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexplained Error Suppression category Readability

Tell me more
What is the issue?

Silent error suppression without explaining why TypeError is expected or can be safely ignored.

Why this matters

Code maintainers will have to dig through the codebase to understand why this error is suppressed, making the code harder to understand and maintain.

Suggested change ∙ Feature Preview

Add a comment explaining the rationale:

# Suppress TypeError when metadata file is invalid or has old format without 'files' field
with suppress(TypeError):
    metadata_files = Metadata.from_file(self.__metadata.path).files

Report a problem with this comment

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +145 to +159
files_list = (
FileInfo(
os.path.basename(self.local_path),
hash_file_normalized(os.path.join(self.local_path)).hexdigest(),
oct(os.stat(os.path.join(self.local_path)).st_mode)[-3:],
),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant file operations category Performance

Tell me more
What is the issue?

Redundant os.path.join calls and file stat operations

Why this matters

Multiple system calls to the same file wastes I/O operations which impacts performance, especially when dealing with many files

Suggested change ∙ Feature Preview

Cache the joined path and file stat results:

full_path = os.path.join(self.local_path)
stat_result = os.stat(full_path)
files_list = (
    FileInfo(
        os.path.basename(self.local_path),
        hash_file_normalized(full_path).hexdigest(),
        oct(stat_result.st_mode)[-3:],
    ),
)

Report a problem with this comment

💬 Looking for more details? Reply to this comment to chat with Korbit.

@sach-edna
Copy link
Contributor

According to the Yaml specification | is used for

block scalars are delimited with indentation with optional modifiers to preserve (|) or fold (>) newlines.
So using it to delimite this data may cause some confusion for parsers.

@ben-edna ben-edna force-pushed the prototype-extended-data branch from 635cf97 to 7db9b3e Compare April 10, 2025 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants