Skip to content

Conversation

@spoorcc
Copy link
Contributor

@spoorcc spoorcc commented Oct 18, 2025

Support for pre-commit hooks
Fixes #19

Description by Korbit AI

What change is being made?

Add a basic dfetch filter command that can list or pass through files to a command, integrate it into the CLI, and update supporting utilities, logging, and tooling configuration (pre-commit hooks, changelog, docs).

Why are these changes being made?

To provide a first-class file-filtering capability that can operate on manifest-scoped projects or stdin/args, and to wire it into the existing CLI and supporting utilities for robust usage and testing. This PR also updates tooling integration and documentation to reflect the new feature.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

Summary by CodeRabbit

  • New Features

    • Added a new filter command to the CLI for file management.
  • Bug Fixes

    • Improved error handling in argument parsing to prevent fatal errors.
    • Enhanced logging behavior during command execution.
  • Documentation

    • Added filter command documentation to the manual.
    • Added integration tests and demonstration examples for filter functionality.

@spoorcc spoorcc marked this pull request as draft October 18, 2025 22:44
Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Status
Error Handling Incorrect exception type caught for argparse errors ▹ view
Readability Magic String Attribute Lookup ▹ view
Logging Command execution logged at DEBUG level ▹ view
Performance Naive string splitting for command parsing ▹ view
Performance Memory inefficient stdin processing ▹ view
Performance Inefficient O(n*m) project path lookup ▹ view
Performance Expensive path resolution per file ▹ view
Design Mixed Responsibilities in Entry Point ▹ view
Readability Unclear list variable names ▹ view
Readability Non-descriptive tuple return type ▹ view
Files scanned
File Path Reviewed
dfetch/log.py
dfetch/util/util.py
dfetch/main.py
dfetch/util/cmdline.py
dfetch/commands/filter.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

Comment on lines 71 to 72
if args.verbose or not getattr(args.func, "SILENT", False):
logger.print_title()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic String Attribute Lookup category Readability

Tell me more
What is the issue?

The use of a magic string 'SILENT' as an attribute lookup makes the code's intent unclear without additional context.

Why this matters

Future maintainers will need to search for where SILENT is defined and understand its purpose. This creates cognitive overhead and potential maintenance issues.

Suggested change ∙ Feature Preview
# Define a constant at module level
SILENT_COMMAND_FLAG = 'SILENT'

# Use in the code
if args.verbose or not getattr(args.func, SILENT_COMMAND_FLAG, False):
    logger.print_title()
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +78 to +79
if not isinstance(cmd, list):
cmd = cmd.split(" ")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive string splitting for command parsing category Performance

Tell me more
What is the issue?

String splitting on single space fails for commands with multiple consecutive spaces or complex arguments.

Why this matters

This naive splitting approach will create empty strings in the command list when there are multiple spaces, potentially causing subprocess execution failures or incorrect argument parsing.

Suggested change ∙ Feature Preview

Use shlex.split() instead of str.split(" ") to properly handle shell-like command parsing with quoted arguments and multiple spaces:

import shlex

if not isinstance(cmd, list):
    cmd = shlex.split(cmd)
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines 136 to 141
for project_path in project_paths:
try:
file.relative_to(project_path)
return project_path
except ValueError:
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient O(n*m) project path lookup category Performance

Tell me more
What is the issue?

The file-in-project check performs O(n) linear search through all project paths for each file, resulting in O(n*m) complexity where n is files and m is projects.

Why this matters

With many files and projects, this nested loop creates quadratic time complexity that will significantly slow down filtering operations as the number of projects grows.

Suggested change ∙ Feature Preview

Pre-sort project paths by depth (deepest first) and use early termination, or consider using a trie-based structure for path prefix matching to reduce average case complexity.

Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

block_outside: list[str] = []

for path_or_arg in input_list:
arg_abs_path = Path(pwd / path_or_arg.strip()).resolve()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expensive path resolution per file category Performance

Tell me more
What is the issue?

Path resolution with resolve() is called for every input file, which involves expensive filesystem operations including symlink resolution and path canonicalization.

Why this matters

The resolve() method performs multiple filesystem syscalls per file, creating significant I/O overhead that scales linearly with the number of input files and can become a bottleneck for large file sets.

Suggested change ∙ Feature Preview

Cache resolved paths or use absolute path construction without full resolution when symlink handling isn't critical:

arg_abs_path = (pwd / path_or_arg.strip()).absolute()

Only call resolve() when necessary for symlink handling.

Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

help="Arguments to pass to the command",
)

def __call__(self, args: argparse.Namespace) -> None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixed Responsibilities in Entry Point category Design

Tell me more
What is the issue?

The call method mixes configuration, business logic, and output handling in a single method.

Why this matters

This violates the Single Responsibility Principle and makes the code less maintainable and harder to test individual components.

Suggested change ∙ Feature Preview

Split the call method into separate methods for configuration, filtering, and output handling:

def __call__(self, args: argparse.Namespace) -> None:
    self._configure_logging(args)
    filtered_args = self._process_filtering(args)
    self._handle_output(args, filtered_args)
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines 105 to 106
block_inside: list[str] = []
block_outside: list[str] = []
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear list variable names category Readability

Tell me more
What is the issue?

The variable names 'block_inside' and 'block_outside' are not immediately clear about what they represent in the context of file filtering.

Why this matters

Unclear variable names force readers to trace through the code to understand their purpose, increasing cognitive load.

Suggested change ∙ Feature Preview
files_inside_projects: list[str] = []
files_outside_projects: list[str] = []
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines 101 to 103
def _filter_files(
self, pwd: Path, topdir: Path, project_paths: set[Path], input_list: list[str]
) -> tuple[list[str], list[str]]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-descriptive tuple return type category Readability

Tell me more
What is the issue?

The return type annotation using tuple[list[str], list[str]] is not descriptive enough to understand what the two lists represent.

Why this matters

Generic tuple return types make it difficult to understand the meaning of each component without looking at the implementation.

Suggested change ∙ Feature Preview
from typing import NamedTuple

class FilterResult(NamedTuple):
    files_inside_projects: list[str]
    files_outside_projects: list[str]

def _filter_files(
    self, pwd: Path, topdir: Path, project_paths: set[Path], input_list: list[str]
) -> FilterResult:
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

@coderabbitai
Copy link

coderabbitai bot commented Nov 1, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

A new filter command is introduced to dfetch that evaluates whether files are under dfetch project control. The implementation includes command registration, logging enhancements, subprocess utilities, documentation updates, pre-commit hook integration, and feature tests.

Changes

Cohort / File(s) Summary
Core filter command
dfetch/commands/filter.py, dfetch/commands/command.py, dfetch/__main__.py
New Filter command class with FilterType enum (BLOCK_ONLY_PATH_TRAVERSAL, BLOCK_IF_INSIDE, BLOCK_IF_OUTSIDE) enabling file filtering logic. Added silent() method to Command base class. Registered Filter in CLI parser with adjusted error handling.
Utility layer
dfetch/log.py, dfetch/util/cmdline.py, dfetch/util/util.py
Added set_level() logging function; new run_on_cmdline_uncaptured() for executing shell commands without output capture; updated in_directory() to accept Union[str, Path] and yield strings.
Changelog
CHANGELOG.rst
Added entry for new filter command (#19).
CI/Workflow
.github/workflows/run.yml, .pre-commit-config.yaml
Added dfetch filter steps to test-cygwin and example workflows; replaced isort, black, and codespell hooks to use dfetch with filter arguments.
Documentation
doc/manual.rst, doc/asciicasts/filter.cast, doc/generate-casts/*
Added Filter section to manual; created asciicast demonstration; added filter-demo.sh script and integrated into cast generation pipeline.
Tests & Features
features/filter-projects.feature, features/steps/generic_steps.py
New Gherkin feature file verifying filter command lists dfetch-controlled files; added stdout teeing context manager and abs_path pattern to test steps for output capture and normalization.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as dfetch filter
    participant Resolve as Argument Resolution
    participant Filter as File Filtering
    participant Exec as Command Execution

    User->>CLI: dfetch filter [--dfetched|-D] [cmd args...]
    CLI->>Resolve: _get_arguments() + _resolve_args()
    Resolve->>Resolve: Read stdin or CLI args
    Resolve->>Resolve: Expand to all non-.git files if empty
    Resolve-->>CLI: Map args to resolved Paths
    CLI->>Filter: _filter_files(topdir, project_paths, input_paths, block_strategy)
    Filter->>Filter: Determine blocklist based on FilterType<br/>(BLOCK_IF_INSIDE, BLOCK_IF_OUTSIDE, BLOCK_ONLY_PATH_TRAVERSAL)
    Filter-->>CLI: Filtered argument list
    alt cmd provided
        CLI->>Exec: run_on_cmdline_uncaptured(filtered_args)
        Exec-->>User: Execute command with filtered args
    else no cmd
        CLI-->>User: Print filtered args to stdout
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • dfetch/commands/filter.py: Substantial new module with multiple interdependent methods (file filtering logic, path containment checks, argument resolution), FilterType enum, and intricate control flow requiring careful validation of filtering strategies.
  • features/steps/generic_steps.py: Context manager and output capture refactoring affects test infrastructure; requires verification that stdout teeing and path normalization don't break existing tests.
  • .pre-commit-config.yaml: Hook restructuring redirects multiple tools through dfetch; confirm filter arguments correctly exclude/include intended file categories.
  • Integration points: Changes span core command logic, utilities, workflows, and tests; ensure consistency across layers.

Poem

A filter blooms in dfetch's garden fair, 🌿
Sorting files with logic and with care,
Pre-commit hooks now guard the repo's gate,
Keeping unwanted changes from their fate—
🐰 Your code's protected, spotless, and complete!

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The PR title is incomplete and vague, ending with 'as suggested in' without specifying what was suggested or referencing the issue number. Complete the title to clearly reference the main change, e.g., 'Add basic filter command for pre-commit hook support' or include the issue number.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed The PR successfully implements a filter command to support pre-commit hooks by checking dfetch control status on files, meeting the requirements of issue #19.
Out of Scope Changes check ✅ Passed All changes are in scope: the filter command implementation, CLI integration, utilities, logging support, documentation, pre-commit config, and test infrastructure are all necessary for delivering the filter command feature.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch spoorcc/issue19

Comment @coderabbitai help to get the list of available commands and usage tips.

@spoorcc
Copy link
Contributor Author

spoorcc commented Nov 5, 2025

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Nov 5, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
dfetch/log.py (1)

60-62: Consider enhancing the docstring with parameter details.

The function wraps coloredlogs.set_level but doesn't document which level values are accepted. Consider adding an Args section that specifies valid levels (e.g., "DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL").

Apply this diff to improve the documentation:

 def set_level(level: str) -> None:
-    """Set the level of the logger."""
+    """Set the level of the logger.
+    
+    Args:
+        level: The logging level (e.g., "DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL").
+    """
     coloredlogs.set_level(level)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c1a84e and 022d7b7.

📒 Files selected for processing (15)
  • .github/workflows/run.yml (2 hunks)
  • .pre-commit-config.yaml (2 hunks)
  • CHANGELOG.rst (1 hunks)
  • dfetch/__main__.py (4 hunks)
  • dfetch/commands/command.py (1 hunks)
  • dfetch/commands/filter.py (1 hunks)
  • dfetch/log.py (1 hunks)
  • dfetch/util/cmdline.py (1 hunks)
  • dfetch/util/util.py (1 hunks)
  • doc/asciicasts/filter.cast (1 hunks)
  • doc/generate-casts/filter-demo.sh (1 hunks)
  • doc/generate-casts/generate-casts.sh (1 hunks)
  • doc/manual.rst (2 hunks)
  • features/filter-projects.feature (1 hunks)
  • features/steps/generic_steps.py (5 hunks)
🧰 Additional context used
🪛 Shellcheck (0.11.0)
doc/generate-casts/filter-demo.sh

[warning] 9-9: Use 'pushd ... || exit' or 'pushd ... || return' in case pushd fails.

(SC2164)


[warning] 21-21: PROMPT_TIMEOUT appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 26-26: Use 'popd ... || exit' or 'popd ... || return' in case popd fails.

(SC2164)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: Codacy Static Code Analysis
  • GitHub Check: test (macos-latest, 3.13)
  • GitHub Check: test (windows-latest, 3.14)
  • GitHub Check: test (windows-latest, 3.13)
  • GitHub Check: test (macos-latest, 3.10)
  • GitHub Check: test (windows-latest, 3.9)
  • GitHub Check: test (windows-latest, 3.11)
  • GitHub Check: test (windows-latest, 3.12)
  • GitHub Check: test (windows-latest, 3.10)
  • GitHub Check: test (macos-latest, 3.14)
  • GitHub Check: test-cygwin
  • GitHub Check: DevContainer Build & Test
  • GitHub Check: build (macos-latest)
  • GitHub Check: build (windows-latest)
  • GitHub Check: build (ubuntu-latest)
  • GitHub Check: test
🔇 Additional comments (11)
CHANGELOG.rst (1)

18-18: LGTM!

The changelog entry correctly documents the new filter command and references the related issue.

dfetch/util/util.py (1)

66-73: LGTM!

The change to always yield str improves consistency in the API. The function now accepts both str and Path as input but consistently returns str, which aligns well with the new filter command's string-based path handling.

.pre-commit-config.yaml (2)

17-18: LGTM!

The integration of the new filter command into pre-commit hooks is well-structured and consistent. The --not-dfetched flag correctly filters out dfetch-managed files from being processed by isort, black, and codespell, which addresses the requirement from issue #19.

Also applies to: 24-25, 104-105


107-107: Clarify the commented exclude line.

The commented exclusion for the asciinema player file is unclear. Should this be uncommented, or is it left as a reference? If it's no longer needed due to the filter command handling exclusions, consider removing it entirely to avoid confusion.

features/steps/generic_steps.py (2)

34-56: LGTM!

The tee_stdout context manager is well-implemented. It correctly duplicates stdout to both the original stream and an in-memory buffer, which is essential for capturing output from the new filter command that writes directly to stdout.


67-78: LGTM!

The updated call_command function correctly uses the new tee_stdout context manager to capture both the traditional captured output and the raw stdout. This dual-capture approach properly supports testing commands that may write directly to stdout versus using the logging framework.

dfetch/commands/command.py (1)

34-43: LGTM!

The silent() method is a well-designed addition to the Command base class. It provides a clean mechanism for commands (like the new Filter command) to opt out of printing the dfetch title, which is appropriate for commands designed to be used in pipelines or scripts.

.github/workflows/run.yml (1)

46-46: LGTM!

The integration of the filter command into CI workflows demonstrates both usage modes: standalone (line 46) and with piped input from find (line 60). This validates that the command works correctly in automated environments.

Also applies to: 60-60

dfetch/util/cmdline.py (1)

78-83: Use shell-aware splitting to preserve complex commands.

Duplicating the cmd.split(" ") logic means quoted arguments or consecutive spaces still break (e.g., --flag="two words" will be split into three tokens). Since this helper is new, please switch to shlex.split (and mirror the change in run_on_cmdline) so both helpers can execute real-world commands without mangling their argv.

-import logging
+import logging
+import shlex
@@
-    if not isinstance(cmd, list):
-        cmd = cmd.split(" ")
+    if not isinstance(cmd, list):
+        cmd = shlex.split(cmd)
@@
-    if not isinstance(cmd, list):
-        cmd = cmd.split(" ")
+    if not isinstance(cmd, list):
+        cmd = shlex.split(cmd)
dfetch/__main__.py (1)

71-72: Fix crash when dfetch is invoked without a subcommand.

parser.set_defaults(func=_help) means args.func can be the plain _help function, which has no silent() attribute. Calling args.func.silent() now raises AttributeError, so running plain dfetch (or any path that leaves the default handler in place) crashes before we can show the help text. Please guard this call (e.g., via getattr(args.func, "silent", lambda: False)()) so non-command invocations keep working.

-    if args.verbose or not args.func.silent():
+    silent_check = getattr(args.func, "silent", lambda: False)
+    if args.verbose or not silent_check():
         logger.print_title()
dfetch/commands/filter.py (1)

99-105: Allow combined dfetched/not-dfetched mode to work as advertised.

If a user supplies both --dfetched and --not-dfetched, they’re clearly asking to forward every file while still blocking path traversal. Today that combination falls into the second branch and drops the dfetched files instead, making the “allow both” use-case impossible and leaving FilterType.BLOCK_ONLY_PATH_TRAVERSAL effectively dead code. Reorder the conditions (or add an explicit check for both flags) so the combined mode yields BLOCK_ONLY_PATH_TRAVERSAL.

-        if args.dfetched and not args.not_dfetched:
-            block_type = FilterType.BLOCK_IF_OUTSIDE
-        elif args.not_dfetched:
-            block_type = FilterType.BLOCK_IF_INSIDE
-        else:
-            block_type = FilterType.BLOCK_ONLY_PATH_TRAVERSAL
+        if args.dfetched and args.not_dfetched:
+            block_type = FilterType.BLOCK_ONLY_PATH_TRAVERSAL
+        elif args.dfetched:
+            block_type = FilterType.BLOCK_IF_OUTSIDE
+        elif args.not_dfetched:
+            block_type = FilterType.BLOCK_IF_INSIDE
+        else:
+            block_type = FilterType.BLOCK_ONLY_PATH_TRAVERSAL

Comment on lines +155 to +164

Filter
------
.. argparse::
:module: dfetch.__main__
:func: create_parser
:prog: dfetch
:path: filter

.. automodule:: dfetch.commands.filter
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove duplicate Filter section.

The Filter section is duplicated in the documentation. It appears at lines 118-130 (after Freeze) and again here at lines 155-164 (after Import). The second occurrence should be removed to avoid duplication and potential documentation build issues.

Apply this diff to remove the duplicate:

 .. automodule:: dfetch.commands.import_
-
-Filter
-------
-.. argparse::
-   :module: dfetch.__main__
-   :func: create_parser
-   :prog: dfetch
-   :path: filter
-
-.. automodule:: dfetch.commands.filter
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Filter
------
.. argparse::
:module: dfetch.__main__
:func: create_parser
:prog: dfetch
:path: filter
.. automodule:: dfetch.commands.filter
🤖 Prompt for AI Agents
In doc/manual.rst around lines 155 to 164, the "Filter" section is a duplicate
of the earlier section (lines ~118-130); remove the entire duplicate block
(lines 155-164) so only the first "Filter" section remains and update any
surrounding spacing or TOC references if necessary to keep formatting
consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for pre-commit hooks

2 participants