-
Notifications
You must be signed in to change notification settings - Fork 4
Prototype extended data #660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/korbit-review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review by Korbit AI
Korbit automatically attempts to detect when you fix issues in new commits.
Category | Issue | Fix Detected |
---|---|---|
Missing File Existence Check ▹ view | ||
Undocumented Regex Pattern ▹ view | ||
Incorrect return type annotation for files property ▹ view | ||
Complex Boolean Logic ▹ view | ||
Missing Error Handling in Directory Traversal ▹ view | ||
Missing Type Hints ▹ view | ||
Directory Traversal Vulnerability ▹ view | ||
Unexplained Error Suppression ▹ view | ||
Redundant file operations ▹ view |
Files scanned
File Path | Reviewed |
---|---|
dfetch/util/util.py | ✅ |
dfetch/project/metadata.py | ✅ |
dfetch/project/vcs.py | ✅ |
Explore our documentation to understand the languages and file types we support and the files we ignore.
Need a new review? Comment
/korbit-review
on this PR and I'll review your latest changes.Korbit Guide: Usage and Customization
Interacting with Korbit
- You can manually ask Korbit to review your PR using the
/korbit-review
command in a comment at the root of your PR.- You can ask Korbit to generate a new PR description using the
/korbit-generate-pr-description
command in any comment on your PR.- Too many Korbit comments? I can resolve all my comment threads if you use the
/korbit-resolve
command in any comment on your PR.- On any given comment that Korbit raises on your pull request, you can have a discussion with Korbit by replying to the comment.
- Help train Korbit to improve your reviews by giving a 👍 or 👎 on the comments Korbit posts.
Customizing Korbit
- Check out our docs on how you can make Korbit work best for you and your team.
- Customize Korbit for your organization through the Korbit Console.
Current Korbit Configuration
General Settings
Setting Value Review Schedule Automatic excluding drafts Max Issue Count 10 Automatic PR Descriptions ✅ Issue Categories
Category Enabled Documentation ✅ Logging ✅ Error Handling ✅ Readability ✅ Design ✅ Performance ✅ Security ✅ Functionality ✅ Feedback and Support
full_path = os.path.join(self.local_path, file.path) | ||
if hash_file_normalized(full_path).hexdigest() != file.hash: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing File Existence Check 
Tell me more
What is the issue?
The code doesn't check if the file exists before attempting to hash it, which could cause crashes.
Why this matters
If a file was deleted but still exists in the metadata, this will raise an unhandled exception when trying to hash a non-existent file.
Suggested change ∙ Feature Preview
Add file existence check before hashing:
full_path = os.path.join(self.local_path, file.path)
if not os.path.exists(full_path):
logger.debug(f"File {full_path} no longer exists!")
return True
if hash_file_normalized(full_path).hexdigest() != file.hash:
💬 Looking for more details? Reply to this comment to chat with Korbit.
digest = hashlib.sha1(usedforsecurity=False) | ||
|
||
if os.path.isfile(file_path): | ||
normalize_re = re.compile(b"\r\n|\r") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undocumented Regex Pattern 
Tell me more
What is the issue?
Regular expression pattern is defined inside the function without explanation of what it matches.
Why this matters
Complex regex patterns without documentation or clear variable names make the code harder to understand and maintain.
Suggested change ∙ Feature Preview
# Define at module level with clear name
LINE_ENDING_PATTERN = re.compile(b"\r\n|\r") # Matches Windows (CRLF) and old Mac (CR) line endings
💬 Looking for more details? Reply to this comment to chat with Korbit.
@property | ||
def files(self) -> Iterable[FileInfo]: | ||
"""File info as stored in the metadata.""" | ||
return self._files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect return type annotation for files property 
Tell me more
What is the issue?
The files property returns Optional[Iterable[FileInfo]] but is annotated to return Iterable[FileInfo]
Why this matters
This type mismatch could cause runtime errors when consumers expect a non-None return value but receive None
Suggested change ∙ Feature Preview
Update the return type annotation to match the actual return type:
@property
def files(self) -> Optional[Iterable[FileInfo]]:
"""File info as stored in the metadata."""
return self._files
💬 Looking for more details? Reply to this comment to chat with Korbit.
if not file_info: | ||
return bool(on_disk_hash) and on_disk_hash != hash_directory( | ||
self.local_path, skiplist=[self.__metadata.FILENAME] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complex Boolean Logic 
Tell me more
What is the issue?
Complex boolean expression with unclear fallback logic mixing multiple conditions.
Why this matters
The nested conditions and boolean operations make it difficult to understand the flow and intention of the code at a glance.
Suggested change ∙ Feature Preview
Split into more explicit conditions:
if not file_info:
if not on_disk_hash:
return False
current_hash = hash_directory(self.local_path, skiplist=[self.__metadata.FILENAME])
return on_disk_hash != current_hash
💬 Looking for more details? Reply to this comment to chat with Korbit.
def recursive_listdir(directory): | ||
"""List all entries in the current directory.""" | ||
entries = os.listdir(directory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing Error Handling in Directory Traversal 
Tell me more
What is the issue?
The function recursive_listdir() doesn't handle potential permission errors or broken symlinks when accessing directories.
Why this matters
If the function encounters a directory without read permissions or a broken symlink, it will raise an unhandled OSError/PermissionError, causing the entire directory traversal to fail.
Suggested change ∙ Feature Preview
Add error handling to gracefully skip inaccessible directories:
def recursive_listdir(directory):
"""List all entries in the current directory."""
try:
entries = os.listdir(directory)
for entry in entries:
full_path = os.path.join(directory, entry)
try:
if os.path.isdir(full_path):
yield from recursive_listdir(full_path)
else:
yield full_path
except (OSError, PermissionError):
continue
except (OSError, PermissionError):
return
💬 Looking for more details? Reply to this comment to chat with Korbit.
@@ -102,6 +103,21 @@ | |||
] | |||
|
|||
|
|||
def recursive_listdir(directory): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing Type Hints 
Tell me more
What is the issue?
Missing type hints for both the parameter and return type in the recursive_listdir function.
Why this matters
Type hints help with code understanding, IDE support, and static type checking. Their absence makes it harder to understand what the function expects and returns without diving into the implementation.
Suggested change ∙ Feature Preview
def recursive_listdir(directory: str) -> Generator[str, None, None]:
💬 Looking for more details? Reply to this comment to chat with Korbit.
entries = os.listdir(directory) | ||
|
||
for entry in entries: | ||
full_path = os.path.join(directory, entry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Directory Traversal Vulnerability 
Tell me more
What is the issue?
The recursive_listdir function is vulnerable to directory traversal attacks if the input directory path is not validated.
Why this matters
Without path validation, malicious input could potentially access files outside the intended directory tree through symbolic links or relative paths.
Suggested change ∙ Feature Preview
Add path validation and resolve symbolic links:
directory = os.path.abspath(directory)
if not os.path.realpath(directory).startswith(os.path.realpath(safe_root)):
raise ValueError("Access denied: Directory outside allowed path")
💬 Looking for more details? Reply to this comment to chat with Korbit.
with suppress(TypeError): | ||
metadata_files = Metadata.from_file(self.__metadata.path).files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unexplained Error Suppression 
Tell me more
What is the issue?
Silent error suppression without explaining why TypeError is expected or can be safely ignored.
Why this matters
Code maintainers will have to dig through the codebase to understand why this error is suppressed, making the code harder to understand and maintain.
Suggested change ∙ Feature Preview
Add a comment explaining the rationale:
# Suppress TypeError when metadata file is invalid or has old format without 'files' field
with suppress(TypeError):
metadata_files = Metadata.from_file(self.__metadata.path).files
💬 Looking for more details? Reply to this comment to chat with Korbit.
files_list = ( | ||
FileInfo( | ||
os.path.basename(self.local_path), | ||
hash_file_normalized(os.path.join(self.local_path)).hexdigest(), | ||
oct(os.stat(os.path.join(self.local_path)).st_mode)[-3:], | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant file operations 
Tell me more
What is the issue?
Redundant os.path.join calls and file stat operations
Why this matters
Multiple system calls to the same file wastes I/O operations which impacts performance, especially when dealing with many files
Suggested change ∙ Feature Preview
Cache the joined path and file stat results:
full_path = os.path.join(self.local_path)
stat_result = os.stat(full_path)
files_list = (
FileInfo(
os.path.basename(self.local_path),
hash_file_normalized(full_path).hexdigest(),
oct(stat_result.st_mode)[-3:],
),
)
💬 Looking for more details? Reply to this comment to chat with Korbit.
According to the Yaml specification
|
635cf97
to
7db9b3e
Compare
This is a prototype of enriching the metadata (
.dfetch_data.yml
) with data about each individual fetched file.This could help solve a few issues:
Next to that, it may assist in more enhancements later on:
dfetch diff
only creates a patch for fetched files.A disadvantage would be that the metadata file becomes longer and
dfetch
becomes slower.This branch sort of works, but it is proof-of-concept. Some things still required:
--force
was provided.Note
Since this may have significant impact on users, I would especially like to get the metadata layout right.
Backwards compatibility is off-course something that is a must to not annoy current users.
I'm looking for any feedback, positive or negative 😉 ( @jgeudens @sach-edna @deminngi
) .
dfetch
make it possible to skip adding the extra info?dfetch
track more outside the content and permissions?Here is a part of the beginning of the metadata file from an example project: