add binary file detection #3

stewones · 2025-09-08T18:41:59Z

Summary by CodeRabbit

New Features
- Directory ingestion now automatically skips binary/non-text files to avoid ingesting non-text content.
Bug Fixes
- Improved robustness of file-reading during ingestion and enhanced logging to include execution context and totals.
Tests
- Updated tests to validate binary-detection behavior and ensure proper test cleanup.
Chores
- Bumped package version to 0.3.1.
- Removed obsolete commented coverage-report steps from CI.
- Broadened .gitignore to ignore all .sqlite files and removed an unused spec field.

codecov · 2025-09-08T18:42:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (e74cb55) to head (4963770).

Additional details and impacted files

@@            Coverage Diff            @@
##              main        #3   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           18        18           
  Lines         2180      2185    +5     
  Branches       434       442    +8     
=========================================
+ Hits          2180      2185    +5

Flag	Coverage Δ
unittests	`100.00% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/core/mcp.ts	`100.00% <100.00%> (ø)`
src/lib/store.ts	`100.00% <100.00%> (ø)`

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e74cb55...4963770. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai · 2025-09-08T18:55:21Z

Walkthrough

Replaces extension-based file filtering with a heuristic binary detector and asynchronous recursive traversal in McpTool; updates tests to mock binary reads and path behavior; bumps package version to 0.3.1; removes two commented coverage-report steps from CI workflow; expands .gitignore sqlite pattern and enhances ingestion log messages; removes sourceId from an Angular docs tool spec.

Changes

Cohort / File(s)	Summary
CI workflow cleanup `.github/workflows/ci.yml`	Removed two commented coverage-report steps (markdown preparation and PR comment); retained active Codecov upload.
Version bump `package.json`	Package version updated from `0.3.0` to `0.3.1`.
Core: binary-safe directory ingestion `src/core/mcp.ts`	Added `isBinaryFile` heuristic (reads first 8KB; null-byte and non-printable ratio checks). Rewrote `fetchFromDirectory` to an async recursive walk that excludes detected binaries and reads non-binary files as UTF-8. Minor `ServerResult` typing formatting changes.
Tests updated for binary detection `test/src/core/mcp.test.ts`	Reworked fs/path mocks to simulate binary vs text reads (supports Buffer returns when encoding is `null`), added `node:path` mock, added open/read/close mocks for readSync usage, adjusted assertions and cleanup unmocking.
Ignore patterns `.gitignore`	Changed `.sqlite` ignore pattern to `*.sqlite` to ignore all SQLite files by extension.
Logging enhancements `src/lib/store.ts`	Augmented two log statements in `SqliteEmbedStore.ingest` to include `[mcpId/toolId]` context and chunk totals in messages.
Spec field removal `src/mcps/angular/tools/docs/index.ts`	Removed `sourceId: 'angular-llm-context'` property from the `spec` object.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant McpTool
  participant FS as FileSystem

  Caller->>McpTool: fetchFromDirectory(contextDir)
  activate McpTool
  loop recursive directory walk
    McpTool->>FS: read first 8KB (no encoding)
    FS-->>McpTool: Buffer
    alt binary detected (null byte or high non-printable ratio)
      McpTool-->>McpTool: skip file
    else non-binary
      McpTool->>FS: readFile(file, "utf-8")
      FS-->>McpTool: text content
      McpTool-->>McpTool: append content block
    end
  end
  McpTool-->>Caller: return ServerResult with collected docs
  deactivate McpTool

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hopped through folders, sniffing bytes and light,
Found sneaky nulls hiding from my sight 🐇
Async paws skip the binary treat,
Only UTF-8 crumbs make it to my seat.
CI trimmed, version bumped—quiet delight.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  - Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.
  - Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/context-dir-resolution

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/src/core/mcp.test.ts (1)
66-71: Fix meta assertion: code now includes dir in ingest meta

init() passes meta.dir (undefined when no contextDir). The exact-match assertion will fail. Allow extra keys or include dir: undefined.
-		expect(ingestSpy).toHaveBeenCalledWith(
-			{ id: 'source-1', meta: { name: 'Bar-MCP', url: 'http://example.com', file: undefined } },
+		expect(ingestSpy).toHaveBeenCalledWith(
+			{ id: 'source-1', meta: expect.objectContaining({ name: 'Bar-MCP', url: 'http://example.com', file: undefined }) },
 			['c1', 'c2'],
 			{ mcpId: 'foo', toolId: 'bar' }
 		);
If you prefer strict equality, add dir: undefined to the expected meta instead.

🧹 Nitpick comments (4)

src/core/mcp.ts (3)
257-261: Call new isBinaryFile signature

If you implement the refactor above, adjust the call to drop the readFileSync argument.
-							const isBinary = await this.isBinaryFile(full, readFileSync);
+							const isBinary = await this.isBinaryFile(full);
243-268: Skip common heavy folders during walk

To avoid needless traversal and I/O (node_modules, .git, dist, build, coverage, etc.), add a small ignore set.
-			const walk = async (dir: string) => {
+			const walk = async (dir: string) => {
+				const IGNORE = new Set(['node_modules', '.git', 'dist', 'build', 'coverage']);
 				let entries: string[] = [];
 				try {
 					entries = readdirSync(dir);
 				} catch {
 					return;
 				}
 				for (const entry of entries) {
+					if (IGNORE.has(entry)) continue;
 					const full = pathMod.join(dir, entry);
274-275: Consider debug logging on read failures

Silently swallowing per-file read errors can hide actionable issues (permissions, encoding). Emit a debug-level log with the path once per failure.
test/src/core/mcp.test.ts (1)

203-205: Move unmocking to afterEach for isolation

To avoid accidental leakage if future assertions throw before cleanup, consider unmocking in an afterEach hook.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e74cb55 and 55e7d94.

📒 Files selected for processing (4)

.github/workflows/ci.yml (0 hunks)
package.json (1 hunks)
src/core/mcp.ts (4 hunks)
test/src/core/mcp.test.ts (2 hunks)

💤 Files with no reviewable changes (1)

.github/workflows/ci.yml

🔇 Additional comments (4)

package.json (1)

3-3: Version bump looks good

No other changes detected here.

src/core/mcp.ts (1)

338-345: LGTM: explicit ServerResult typing

The explicit type annotation is clear and safe; appending prompt later is fine.

test/src/core/mcp.test.ts (2)

116-136: Good binary-detection mocking

Clear separation between binary-probe reads (encoding: null) and text reads.

138-152: Path mock is adequate for the test scope

Join/relative/dirname implementations are minimal and sufficient here.

src/core/mcp.ts

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/core/mcp.ts (1)
259-261: Fix signature mismatch: remove extra argument to isBinaryFile
Call site still passes readFileSync, but isBinaryFile now accepts only (filePath). This is a TS error.
-							const isBinary = await this.isBinaryFile(full, readFileSync);
+							const isBinary = await this.isBinaryFile(full);
Run to ensure no other call sites pass extra args:
#!/bin/bash
rg -nP -C1 'isBinaryFile\s*\([^,\n)]+,\s*' --type=ts

🧹 Nitpick comments (3)

src/core/mcp.ts (3)
211-227: Heuristic may misclassify valid UTF‑8 as binary
Counting bytes >126 as “non‑printable” penalizes multi‑byte UTF‑8 text (non‑ASCII languages). Consider a quick UTF‑8 sanity check (e.g., attempt decode and detect U+FFFD rate) before concluding “binary.” I can draft a lightweight check if you want.

239-242: Build paths with path.join for cross‑platform correctness
Avoid string concatenation for paths.
-			const dirToRead = `${baseDir}/${this.spec.contextDir}`;
-			const { readdirSync, statSync, readFileSync } = await import('node:fs');
-			const pathMod = await import('node:path');
+			const pathMod = await import('node:path');
+			const dirToRead = pathMod.join(baseDir, this.spec.contextDir!);
+			const { readdirSync, statSync, readFileSync } = await import('node:fs');
254-257: Skip symlinks to avoid cycles and surprises
Following symlinks can create loops or escape the intended dir. Prefer lstatSync and skip symbolic links.
-						const st = statSync(full);
+						const st = lstatSync(full);
+						if (st.isSymbolicLink?.()) {
+							continue;
+						}
Also update the import:
-			const { readdirSync, statSync, readFileSync } = await import('node:fs');
+			const { readdirSync, lstatSync, readFileSync } = await import('node:fs');

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 55e7d94 and f41ff63.

📒 Files selected for processing (1)

src/core/mcp.ts (4 hunks)

🔇 Additional comments (1)

src/core/mcp.ts (1)

339-346: LGTM on the ServerResult construction
Type annotation and payload wiring look correct.

coderabbitai · 2025-09-08T19:28:01Z

src/core/mcp.ts

+	private async isBinaryFile(filePath: string): Promise<boolean> {
+		const { openSync, readSync, closeSync } = await import('node:fs');
+		try {
+			// Read first 8KB only
+			const fd = openSync(filePath, 'r');
+			const buffer = Buffer.allocUnsafe(8192);
+			const bytesRead = readSync(fd, buffer, 0, 8192, 0);
+			closeSync(fd);
+			const chunk = buffer.subarray(0, bytesRead);
+


⚠️ Potential issue

Close the file descriptor on all paths (use finally)
If readSync throws, closeSync isn’t reached, leaking the fd. Wrap closeSync in a finally and guard it.

- private async isBinaryFile(filePath: string): Promise<boolean> { - const { openSync, readSync, closeSync } = await import('node:fs'); - try { - // Read first 8KB only - const fd = openSync(filePath, 'r'); - const buffer = Buffer.allocUnsafe(8192); - const bytesRead = readSync(fd, buffer, 0, 8192, 0); - closeSync(fd); - const chunk = buffer.subarray(0, bytesRead); + private async isBinaryFile(filePath: string): Promise<boolean> { + const { openSync, readSync, closeSync } = await import('node:fs'); + let fd: number | undefined; + try { + // Read first 8KB only + fd = openSync(filePath, 'r'); + const buffer = Buffer.allocUnsafe(8192); + const bytesRead = readSync(fd, buffer, 0, 8192, 0); + const chunk = buffer.subarray(0, bytesRead); @@ - } catch { - // If we can't read the file, assume it's binary to be safe - return true; - } + } catch { + // If we can't read the file, assume it's binary to be safe + return true; + } finally { + if (fd !== undefined) { + try { closeSync(fd); } catch {} + } + }

Also applies to: 228-231

🤖 Prompt for AI Agents

In src/core/mcp.ts around lines 196 to 205 the file descriptor opened with openSync may not be closed if readSync throws; change the function to ensure closeSync is always called by moving closeSync into a finally block (track the fd variable outside try, check fd !== undefined before closing) so the descriptor is closed on success and on error. Apply the same pattern to the related code at lines 228-231 to guard against fd leaks.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (6)

.gitignore (1)
19-19: Ignore SQLite sidecar files too

Add WAL/SHM/journal patterns so devs don’t accidentally commit SQLite sidecars.
 *.sqlite
+*.sqlite-wal
+*.sqlite-shm
+*.sqlite-journal
src/lib/store.ts (1)
176-179: Make progress logs 1-based (friendlier) and less ambiguous

Current logs show 0-based chunk numbers. Consider 1-based display; also “Finished ingestion with X chunks” can be read as inserted count when it’s the processed index. Keep as-is if you prefer, but this improves clarity.
- log.warn(`[${mcpId}/${toolId}] Embedding chunk ${idx} of ${chunks.length}`);
+ log.warn(`[${mcpId}/${toolId}] Embedding chunk ${idx + 1} of ${chunks.length}`);

- log.warn(`[${mcpId}/${toolId}] Inserting chunk ${idx} of ${chunks.length}`);
+ log.warn(`[${mcpId}/${toolId}] Inserting chunk ${idx + 1} of ${chunks.length}`);
Optional: also log “processed vs inserted” totals at the end to reflect skips due to hash matches.
test/src/core/mcp.test.ts (4)
116-137: Drop readFileSync “binary detection” branch to match implementation

Binary sniffing is now via openSync/readSync; keeping a readFileSync({ encoding: null }) path adds confusion. Simplify to text-read paths only here.

153-185: Track fd→path instead of peeking last open call

Using the last open call to infer path is brittle if multiple fds were used. Map file descriptors to paths for robustness.

Example:
const fdToPath = new Map<number, string>();
let nextFd = 100;

const openSyncMock = vi.fn((p: string) => {
  const fd = nextFd++;
  fdToPath.set(fd, p);
  return fd;
});

const readSyncMock = vi.fn((fd: number, buffer: Buffer, offset: number, length: number) => {
  const mockPath = fdToPath.get(fd) || '';
  // ... choose content from mockPath ...
});
228-231: Assert binaries weren’t text-read

Strengthen the expectation to ensure binary files were skipped for text reads.
expect(readFileMock).toHaveBeenCalled();
expect(readFileMock.mock.calls.some(([p, opts]) => /img\.png$/.test(p) && opts?.encoding !== null)).toBe(false);
expect(readFileMock.mock.calls.some(([p]) => /a\.md$/.test(p))).toBe(true);
expect(readFileMock.mock.calls.some(([p]) => /b\.txt$/.test(p))).toBe(true);
239-242: Reset module registry after unmock to avoid cross-test bleed

Follow unmock with resetModules so subsequent tests don’t retain cached mocked dependencies.
vi.doUnmock('node:fs');
vi.doUnmock('node:path');
vi.resetModules();

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f41ff63 and 4963770.

📒 Files selected for processing (5)

.gitignore (1 hunks)
src/core/mcp.ts (4 hunks)
src/lib/store.ts (1 hunks)
src/mcps/angular/tools/docs/index.ts (0 hunks)
test/src/core/mcp.test.ts (2 hunks)

💤 Files with no reviewable changes (1)

src/mcps/angular/tools/docs/index.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/core/mcp.ts

🧰 Additional context used

🧬 Code graph analysis (1)

src/lib/store.ts (1)

src/lib/log.ts (1)

log (76-128)

coderabbitai · 2025-09-08T20:06:51Z

test/src/core/mcp.test.ts

 		vi.doMock('node:fs', () => ({
 			readdirSync: readdirMock,
 			statSync: statMock,
 			readFileSync: readFileMock,
+			openSync: openSyncMock,
+			readSync: readSyncMock,
+			closeSync: closeSyncMock,
 		}));

+		vi.doMock('node:path', () => pathMock);
+


⚠️ Potential issue

doMock after importing McpTool won’t affect it — reset modules and re-import

These tests mock node:fs/node:path with vi.doMock, but McpTool was imported earlier at file top. Since doMock isn’t hoisted, the mocks won’t be applied to the already-loaded module, causing flakiness and leaking real FS. Reset and re-import the SUT, then define classes against that import.

Minimal fix inside each test after doMock:

vi.resetModules(); const { McpTool: LocalMcpTool } = await import('../../../src/core/mcp'); // then: class DirTool extends LocalMcpTool { ... }

Also applies to: 250-255, 289-294, 328-333

🤖 Prompt for AI Agents

In test/src/core/mcp.test.ts around lines 186 to 196, the test calls vi.doMock after McpTool was already imported so the mocks won’t be applied to the loaded module; after calling vi.doMock add vi.resetModules() and re-import McpTool from '../../../src/core/mcp' (e.g. const { McpTool: LocalMcpTool } = await import(...)) and then define any test-local subclasses against LocalMcpTool; apply the same change at the other affected ranges (lines 250-255, 289-294, 328-333).

add binary file detection

9ea24d2

fix coverage

55e7d94

coderabbitai bot reviewed Sep 8, 2025

View reviewed changes

src/core/mcp.ts Show resolved Hide resolved

apply coderabbit suggestion

f41ff63

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai bot reviewed Sep 8, 2025

View reviewed changes

handle binary detection more gracefully

4963770

coderabbitai bot reviewed Sep 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add binary file detection #3

add binary file detection #3

Uh oh!

stewones commented Sep 8, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

codecov bot commented Sep 8, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Sep 8, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 8, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add binary file detection #3

Are you sure you want to change the base?

add binary file detection #3

Uh oh!

Conversation

stewones commented Sep 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

codecov bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stewones commented Sep 8, 2025 •

edited by coderabbitai bot

Loading

codecov bot commented Sep 8, 2025 •

edited

Loading

coderabbitai bot commented Sep 8, 2025 •

edited

Loading