fix(data-loader): keep entry with highest output_tokens during deduplication #771

wg-whm · 2025-12-28T05:33:46Z

Summary

Fix streaming artifact bug where duplicate transcript entries keep incorrect output_tokens count
Claude Code creates multiple entries per response during streaming - first entry has real token count, subsequent entries have low delta counts (1-3 tokens)
Changed deduplication logic to keep entry with highest output_tokens instead of first encountered
Updated test to reflect new expected behavior

Problem

When Claude Code streams responses, it creates multiple JSONL entries with the same messageId:requestId. The first entry contains the accurate output token count, but later entries contain only the streaming delta (often 1-3 tokens). The previous deduplication logic kept whichever entry was encountered first, which was essentially random.

Solution

Modified data-loader.ts to:

Track output_tokens for each unique hash during deduplication
Keep only the entry with the highest output_tokens value
Replace previously stored entries if a new entry has higher token count

Test plan

All 325 tests pass
Format and typecheck pass
Updated "should process files in chronological order" test to "should keep entry with highest output_tokens regardless of file order"

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved duplicate-entry handling during data loading: when duplicates occur, the system now keeps the entry with the highest output-token count and removes lower-priority duplicates consistently across daily, session, and block aggregations.
Tests
- Added/updated tests to validate deduplication and replacement behavior, including cases where later higher-token entries supersede earlier ones.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Claude Code creates multiple transcript entries per response during streaming. The first entry often has a low output_tokens count (1-3) while a subsequent entry has the correct cumulative count. The previous deduplication logic kept the first entry encountered, resulting in inaccurate token counts. This fix modifies the deduplication to: - Track both the entry index and output_tokens for each message+request hash - When a duplicate is found with higher output_tokens, replace the old entry - Filter out replaced entries after processing Fixes streaming artifact causing incorrect output token reporting. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Update test assertion to expect the entry with highest output_tokens to be kept, rather than the chronologically first entry. This aligns with the streaming artifact fix that correctly preserves accurate token counts. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

coderabbitai · 2025-12-28T05:33:56Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

Introduces Map-based per-entry deduplication in the ccusage data loader: entries are hashed, tracked with index and output_tokens, lower-token duplicates are replaced by later higher-token entries, and superseded entries are removed before daily, session, and block aggregations. (≤50 words)

Changes

Cohort / File(s)	Summary
Deduplication core `apps/ccusage/src/data-loader.ts`	Adds `DedupeEntry` type and `Map<string, DedupeEntry>` tracking; introduces `shouldSkipEntry`, `markAsProcessed`, and `indicesToRemove` logic to prefer entries with higher `output_tokens` and replace older entries.
Data loading paths `apps/ccusage/src/data-loader.ts`	Applies deduplication to `loadDailyUsageData`, `loadSessionData`, and `loadSessionBlockData`; switches grouping and aggregation to operate on deduped entries.
Tests `apps/ccusage/test/...`	Updates/extends tests to assert "highest output_tokens priority", replacement of lower-token entries by later higher-token ones, and deduplication correctness across daily/session/block contexts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

fix(ccusage): use streaming to handle large JSONL files #706 — Touches the same loading functions (loadDailyUsageData, loadSessionData, loadSessionBlockData); overlaps with dedup logic changes.
fix: implement chronological deduplication for branched conversations #58 — Implements a different dedupe strategy in the same file (chronological/message-id based), potentially conflicting with the token-priority approach.

Suggested reviewers

ryoppippi
skylinesales

Poem

🐰 I hop through rows and hashes bright,
Sniffing tokens left and right,
Older duplicates I gently toss,
Higher counts get topmost gloss,
Crunching data — nibble, delight! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: implementing deduplication logic that retains entries with the highest output_tokens count.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab7bb44 and c4abe62.

📒 Files selected for processing (1)

apps/ccusage/src/data-loader.ts

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

apps/ccusage/src/data-loader.ts (1)

4460-4501: Incorrect fixture structure will cause test to fail.

The test fixture places JSONL files directly under projects/ but globUsageFiles expects the pattern projects/**/session/*.jsonl (or similar nested structure). Comparing with other tests (e.g., lines 4416-4443), the correct structure should be:

 await using fixture = await createFixture({
   projects: {
-    'newer.jsonl': JSON.stringify({
+    project1: {
+      session1: {
+        'newer.jsonl': JSON.stringify({
           timestamp: '2025-01-15T10:00:00Z',
           message: {
             id: 'msg_123',
             usage: {
               input_tokens: 200,
               output_tokens: 100,
             },
           },
           requestId: 'req_456',
           costUSD: 0.002,
         }),
-    'older.jsonl': JSON.stringify({
+      },
+      session2: {
+        'older.jsonl': JSON.stringify({
           timestamp: '2025-01-10T10:00:00Z',
           message: {
             id: 'msg_123',
             usage: {
               input_tokens: 100,
               output_tokens: 50,
             },
           },
           requestId: 'req_456',
           costUSD: 0.001,
         }),
+      },
+    },
   },
 });

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6335626 and 7073a37.

📒 Files selected for processing (1)

apps/ccusage/src/data-loader.ts

🧰 Additional context used

📓 Path-based instructions (7)

apps/ccusage/src/**/*.ts

📄 CodeRabbit inference engine (apps/ccusage/CLAUDE.md)

apps/ccusage/src/**/*.ts: Write tests in-source using if (import.meta.vitest != null) blocks instead of separate test files
Use Vitest globals (describe, it, expect) without imports in test blocks
In tests, use current Claude 4 models (sonnet-4, opus-4)
Use fs-fixture with createFixture() to simulate Claude data in tests
Only export symbols that are actually used by other modules
Do not use console.log; use the logger utilities from src/logger.ts instead

Files:

apps/ccusage/src/data-loader.ts

apps/ccusage/**/*.ts

📄 CodeRabbit inference engine (apps/ccusage/CLAUDE.md)

apps/ccusage/**/*.ts: NEVER use await import() dynamic imports anywhere (especially in tests)
Prefer @praha/byethrow Result type for error handling instead of try-catch
Use .ts extensions for local imports (e.g., import { foo } from './utils.ts')

Files:

apps/ccusage/src/data-loader.ts

**/*.{ts,tsx,js,jsx}