-
Notifications
You must be signed in to change notification settings - Fork 1
add binary file detection #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 18 18
Lines 2180 2185 +5
Branches 434 442 +8
=========================================
+ Hits 2180 2185 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
WalkthroughReplaces extension-based file filtering with a heuristic binary detector and asynchronous recursive traversal in McpTool; updates tests to mock binary reads and path behavior; bumps package version to 0.3.1; removes two commented coverage-report steps from CI workflow; expands .gitignore sqlite pattern and enhances ingestion log messages; removes Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Caller
participant McpTool
participant FS as FileSystem
Caller->>McpTool: fetchFromDirectory(contextDir)
activate McpTool
loop recursive directory walk
McpTool->>FS: read first 8KB (no encoding)
FS-->>McpTool: Buffer
alt binary detected (null byte or high non-printable ratio)
McpTool-->>McpTool: skip file
else non-binary
McpTool->>FS: readFile(file, "utf-8")
FS-->>McpTool: text content
McpTool-->>McpTool: append content block
end
end
McpTool-->>Caller: return ServerResult with collected docs
deactivate McpTool
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs. ✨ Finishing Touches
🧪 Generate unit tests
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/src/core/mcp.test.ts (1)
66-71: Fix meta assertion: code now includes dir in ingest metainit() passes meta.dir (undefined when no contextDir). The exact-match assertion will fail. Allow extra keys or include dir: undefined.
- expect(ingestSpy).toHaveBeenCalledWith( - { id: 'source-1', meta: { name: 'Bar-MCP', url: 'http://example.com', file: undefined } }, + expect(ingestSpy).toHaveBeenCalledWith( + { id: 'source-1', meta: expect.objectContaining({ name: 'Bar-MCP', url: 'http://example.com', file: undefined }) }, ['c1', 'c2'], { mcpId: 'foo', toolId: 'bar' } );If you prefer strict equality, add
dir: undefinedto the expected meta instead.
🧹 Nitpick comments (4)
src/core/mcp.ts (3)
257-261: Call new isBinaryFile signatureIf you implement the refactor above, adjust the call to drop the readFileSync argument.
- const isBinary = await this.isBinaryFile(full, readFileSync); + const isBinary = await this.isBinaryFile(full);
243-268: Skip common heavy folders during walkTo avoid needless traversal and I/O (node_modules, .git, dist, build, coverage, etc.), add a small ignore set.
- const walk = async (dir: string) => { + const walk = async (dir: string) => { + const IGNORE = new Set(['node_modules', '.git', 'dist', 'build', 'coverage']); let entries: string[] = []; try { entries = readdirSync(dir); } catch { return; } for (const entry of entries) { + if (IGNORE.has(entry)) continue; const full = pathMod.join(dir, entry);
274-275: Consider debug logging on read failuresSilently swallowing per-file read errors can hide actionable issues (permissions, encoding). Emit a debug-level log with the path once per failure.
test/src/core/mcp.test.ts (1)
203-205: Move unmocking to afterEach for isolationTo avoid accidental leakage if future assertions throw before cleanup, consider unmocking in an afterEach hook.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.github/workflows/ci.yml(0 hunks)package.json(1 hunks)src/core/mcp.ts(4 hunks)test/src/core/mcp.test.ts(2 hunks)
💤 Files with no reviewable changes (1)
- .github/workflows/ci.yml
🔇 Additional comments (4)
package.json (1)
3-3: Version bump looks goodNo other changes detected here.
src/core/mcp.ts (1)
338-345: LGTM: explicit ServerResult typingThe explicit type annotation is clear and safe; appending prompt later is fine.
test/src/core/mcp.test.ts (2)
116-136: Good binary-detection mockingClear separation between binary-probe reads (encoding: null) and text reads.
138-152: Path mock is adequate for the test scopeJoin/relative/dirname implementations are minimal and sufficient here.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/core/mcp.ts (1)
259-261: Fix signature mismatch: remove extra argument to isBinaryFile
Call site still passes readFileSync, but isBinaryFile now accepts only (filePath). This is a TS error.- const isBinary = await this.isBinaryFile(full, readFileSync); + const isBinary = await this.isBinaryFile(full);Run to ensure no other call sites pass extra args:
#!/bin/bash rg -nP -C1 'isBinaryFile\s*\([^,\n)]+,\s*' --type=ts
🧹 Nitpick comments (3)
src/core/mcp.ts (3)
211-227: Heuristic may misclassify valid UTF‑8 as binary
Counting bytes >126 as “non‑printable” penalizes multi‑byte UTF‑8 text (non‑ASCII languages). Consider a quick UTF‑8 sanity check (e.g., attempt decode and detect U+FFFD rate) before concluding “binary.” I can draft a lightweight check if you want.
239-242: Build paths with path.join for cross‑platform correctness
Avoid string concatenation for paths.- const dirToRead = `${baseDir}/${this.spec.contextDir}`; - const { readdirSync, statSync, readFileSync } = await import('node:fs'); - const pathMod = await import('node:path'); + const pathMod = await import('node:path'); + const dirToRead = pathMod.join(baseDir, this.spec.contextDir!); + const { readdirSync, statSync, readFileSync } = await import('node:fs');
254-257: Skip symlinks to avoid cycles and surprises
Following symlinks can create loops or escape the intended dir. Prefer lstatSync and skip symbolic links.- const st = statSync(full); + const st = lstatSync(full); + if (st.isSymbolicLink?.()) { + continue; + }Also update the import:
- const { readdirSync, statSync, readFileSync } = await import('node:fs'); + const { readdirSync, lstatSync, readFileSync } = await import('node:fs');
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/core/mcp.ts(4 hunks)
🔇 Additional comments (1)
src/core/mcp.ts (1)
339-346: LGTM on the ServerResult construction
Type annotation and payload wiring look correct.
| private async isBinaryFile(filePath: string): Promise<boolean> { | ||
| const { openSync, readSync, closeSync } = await import('node:fs'); | ||
| try { | ||
| // Read first 8KB only | ||
| const fd = openSync(filePath, 'r'); | ||
| const buffer = Buffer.allocUnsafe(8192); | ||
| const bytesRead = readSync(fd, buffer, 0, 8192, 0); | ||
| closeSync(fd); | ||
| const chunk = buffer.subarray(0, bytesRead); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close the file descriptor on all paths (use finally)
If readSync throws, closeSync isn’t reached, leaking the fd. Wrap closeSync in a finally and guard it.
- private async isBinaryFile(filePath: string): Promise<boolean> {
- const { openSync, readSync, closeSync } = await import('node:fs');
- try {
- // Read first 8KB only
- const fd = openSync(filePath, 'r');
- const buffer = Buffer.allocUnsafe(8192);
- const bytesRead = readSync(fd, buffer, 0, 8192, 0);
- closeSync(fd);
- const chunk = buffer.subarray(0, bytesRead);
+ private async isBinaryFile(filePath: string): Promise<boolean> {
+ const { openSync, readSync, closeSync } = await import('node:fs');
+ let fd: number | undefined;
+ try {
+ // Read first 8KB only
+ fd = openSync(filePath, 'r');
+ const buffer = Buffer.allocUnsafe(8192);
+ const bytesRead = readSync(fd, buffer, 0, 8192, 0);
+ const chunk = buffer.subarray(0, bytesRead);
@@
- } catch {
- // If we can't read the file, assume it's binary to be safe
- return true;
- }
+ } catch {
+ // If we can't read the file, assume it's binary to be safe
+ return true;
+ } finally {
+ if (fd !== undefined) {
+ try { closeSync(fd); } catch {}
+ }
+ }Also applies to: 228-231
🤖 Prompt for AI Agents
In src/core/mcp.ts around lines 196 to 205 the file descriptor opened with
openSync may not be closed if readSync throws; change the function to ensure
closeSync is always called by moving closeSync into a finally block (track the
fd variable outside try, check fd !== undefined before closing) so the
descriptor is closed on success and on error. Apply the same pattern to the
related code at lines 228-231 to guard against fd leaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
.gitignore (1)
19-19: Ignore SQLite sidecar files tooAdd WAL/SHM/journal patterns so devs don’t accidentally commit SQLite sidecars.
*.sqlite +*.sqlite-wal +*.sqlite-shm +*.sqlite-journalsrc/lib/store.ts (1)
176-179: Make progress logs 1-based (friendlier) and less ambiguousCurrent logs show 0-based chunk numbers. Consider 1-based display; also “Finished ingestion with X chunks” can be read as inserted count when it’s the processed index. Keep as-is if you prefer, but this improves clarity.
- log.warn(`[${mcpId}/${toolId}] Embedding chunk ${idx} of ${chunks.length}`); + log.warn(`[${mcpId}/${toolId}] Embedding chunk ${idx + 1} of ${chunks.length}`); - log.warn(`[${mcpId}/${toolId}] Inserting chunk ${idx} of ${chunks.length}`); + log.warn(`[${mcpId}/${toolId}] Inserting chunk ${idx + 1} of ${chunks.length}`);Optional: also log “processed vs inserted” totals at the end to reflect skips due to hash matches.
test/src/core/mcp.test.ts (4)
116-137: Drop readFileSync “binary detection” branch to match implementationBinary sniffing is now via openSync/readSync; keeping a readFileSync({ encoding: null }) path adds confusion. Simplify to text-read paths only here.
153-185: Track fd→path instead of peeking last open callUsing the last open call to infer path is brittle if multiple fds were used. Map file descriptors to paths for robustness.
Example:
const fdToPath = new Map<number, string>(); let nextFd = 100; const openSyncMock = vi.fn((p: string) => { const fd = nextFd++; fdToPath.set(fd, p); return fd; }); const readSyncMock = vi.fn((fd: number, buffer: Buffer, offset: number, length: number) => { const mockPath = fdToPath.get(fd) || ''; // ... choose content from mockPath ... });
228-231: Assert binaries weren’t text-readStrengthen the expectation to ensure binary files were skipped for text reads.
expect(readFileMock).toHaveBeenCalled(); expect(readFileMock.mock.calls.some(([p, opts]) => /img\.png$/.test(p) && opts?.encoding !== null)).toBe(false); expect(readFileMock.mock.calls.some(([p]) => /a\.md$/.test(p))).toBe(true); expect(readFileMock.mock.calls.some(([p]) => /b\.txt$/.test(p))).toBe(true);
239-242: Reset module registry after unmock to avoid cross-test bleedFollow unmock with resetModules so subsequent tests don’t retain cached mocked dependencies.
vi.doUnmock('node:fs'); vi.doUnmock('node:path'); vi.resetModules();
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.gitignore(1 hunks)src/core/mcp.ts(4 hunks)src/lib/store.ts(1 hunks)src/mcps/angular/tools/docs/index.ts(0 hunks)test/src/core/mcp.test.ts(2 hunks)
💤 Files with no reviewable changes (1)
- src/mcps/angular/tools/docs/index.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/core/mcp.ts
🧰 Additional context used
🧬 Code graph analysis (1)
src/lib/store.ts (1)
src/lib/log.ts (1)
log(76-128)
| vi.doMock('node:fs', () => ({ | ||
| readdirSync: readdirMock, | ||
| statSync: statMock, | ||
| readFileSync: readFileMock, | ||
| openSync: openSyncMock, | ||
| readSync: readSyncMock, | ||
| closeSync: closeSyncMock, | ||
| })); | ||
|
|
||
| vi.doMock('node:path', () => pathMock); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doMock after importing McpTool won’t affect it — reset modules and re-import
These tests mock node:fs/node:path with vi.doMock, but McpTool was imported earlier at file top. Since doMock isn’t hoisted, the mocks won’t be applied to the already-loaded module, causing flakiness and leaking real FS. Reset and re-import the SUT, then define classes against that import.
Minimal fix inside each test after doMock:
vi.resetModules();
const { McpTool: LocalMcpTool } = await import('../../../src/core/mcp');
// then: class DirTool extends LocalMcpTool { ... }Also applies to: 250-255, 289-294, 328-333
🤖 Prompt for AI Agents
In test/src/core/mcp.test.ts around lines 186 to 196, the test calls vi.doMock
after McpTool was already imported so the mocks won’t be applied to the loaded
module; after calling vi.doMock add vi.resetModules() and re-import McpTool from
'../../../src/core/mcp' (e.g. const { McpTool: LocalMcpTool } = await
import(...)) and then define any test-local subclasses against LocalMcpTool;
apply the same change at the other affected ranges (lines 250-255, 289-294,
328-333).
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores