-
-
Notifications
You must be signed in to change notification settings - Fork 312
fix(data-loader): keep entry with highest output_tokens during deduplication #771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Claude Code creates multiple transcript entries per response during streaming. The first entry often has a low output_tokens count (1-3) while a subsequent entry has the correct cumulative count. The previous deduplication logic kept the first entry encountered, resulting in inaccurate token counts. This fix modifies the deduplication to: - Track both the entry index and output_tokens for each message+request hash - When a duplicate is found with higher output_tokens, replace the old entry - Filter out replaced entries after processing Fixes streaming artifact causing incorrect output token reporting. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Update test assertion to expect the entry with highest output_tokens to be kept, rather than the chronologically first entry. This aligns with the streaming artifact fix that correctly preserves accurate token counts. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Caution Review failedThe pull request is closed. 📝 WalkthroughWalkthroughIntroduces Map-based per-entry deduplication in the ccusage data loader: entries are hashed, tracked with index and output_tokens, lower-token duplicates are replaced by later higher-token entries, and superseded entries are removed before daily, session, and block aggregations. (≤50 words) Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
apps/ccusage/src/data-loader.ts (1)
4460-4501: Incorrect fixture structure will cause test to fail.The test fixture places JSONL files directly under
projects/butglobUsageFilesexpects the patternprojects/**/session/*.jsonl(or similar nested structure). Comparing with other tests (e.g., lines 4416-4443), the correct structure should be:await using fixture = await createFixture({ projects: { - 'newer.jsonl': JSON.stringify({ + project1: { + session1: { + 'newer.jsonl': JSON.stringify({ timestamp: '2025-01-15T10:00:00Z', message: { id: 'msg_123', usage: { input_tokens: 200, output_tokens: 100, }, }, requestId: 'req_456', costUSD: 0.002, }), - 'older.jsonl': JSON.stringify({ + }, + session2: { + 'older.jsonl': JSON.stringify({ timestamp: '2025-01-10T10:00:00Z', message: { id: 'msg_123', usage: { input_tokens: 100, output_tokens: 50, }, }, requestId: 'req_456', costUSD: 0.001, }), + }, + }, }, });
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
apps/ccusage/src/data-loader.ts
🧰 Additional context used
📓 Path-based instructions (7)
apps/ccusage/src/**/*.ts
📄 CodeRabbit inference engine (apps/ccusage/CLAUDE.md)
apps/ccusage/src/**/*.ts: Write tests in-source usingif (import.meta.vitest != null)blocks instead of separate test files
Use Vitest globals (describe,it,expect) without imports in test blocks
In tests, use current Claude 4 models (sonnet-4, opus-4)
Usefs-fixturewithcreateFixture()to simulate Claude data in tests
Only export symbols that are actually used by other modules
Do not use console.log; use the logger utilities fromsrc/logger.tsinstead
Files:
apps/ccusage/src/data-loader.ts
apps/ccusage/**/*.ts
📄 CodeRabbit inference engine (apps/ccusage/CLAUDE.md)
apps/ccusage/**/*.ts: NEVER useawait import()dynamic imports anywhere (especially in tests)
Prefer@praha/byethrowResult type for error handling instead of try-catch
Use.tsextensions for local imports (e.g.,import { foo } from './utils.ts')
Files:
apps/ccusage/src/data-loader.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.{ts,tsx,js,jsx}: Use ESLint for linting and formatting with tab indentation and double quotes
No console.log allowed except where explicitly disabled with eslint-disable; use logger.ts instead
Use file paths with Node.js path utilities for cross-platform compatibility
Use variables starting with lowercase (camelCase) for variable names
Can use UPPER_SNAKE_CASE for constants
Files:
apps/ccusage/src/data-loader.ts
**/*.ts{,x}
📄 CodeRabbit inference engine (CLAUDE.md)
Use TypeScript with strict mode and bundler module resolution
Files:
apps/ccusage/src/data-loader.ts
**/*.{ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.{ts,tsx}: Use.tsextensions for local file imports (e.g.,import { foo } from './utils.ts')
Prefer @praha/byethrow Result type over traditional try-catch for functional error handling
UseResult.try()for wrapping operations that may throw (JSON parsing, etc.)
UseResult.isFailure()for checking errors (more readable than!Result.isSuccess())
Use early return pattern (if (Result.isFailure(result)) continue;) instead of ternary operators when checking Results
Keep traditional try-catch only for file I/O with complex error handling or legacy code that's hard to refactor
Always useResult.isFailure()andResult.isSuccess()type guards for better code clarity
Use uppercase (PascalCase) for type names
Only export constants, functions, and types that are actually used by other modules - internal constants used only within the same file should NOT be exported
In-source testing pattern: write tests directly in source files usingif (import.meta.vitest != null)blocks
CRITICAL: DO NOT useawait import()dynamic imports anywhere in the codebase - this causes tree-shaking issues
CRITICAL: Never use dynamic imports withawait import()in vitest test blocks - this is particularly problematic for test execution
Vitest globals (describe,it,expect) are enabled and available without imports since globals are configured
Create mock data usingfs-fixturewithcreateFixture()for Claude data directory simulation in tests
All test files must use current Claude 4 models (claude-sonnet-4-20250514, claude-opus-4-20250514), not outdated Claude 3 models
Model names in tests must exactly match LiteLLM's pricing database entries
Files:
apps/ccusage/src/data-loader.ts
**/data-loader.ts
📄 CodeRabbit inference engine (CLAUDE.md)
Silently skip malformed JSONL lines during parsing in data loading operations
Files:
apps/ccusage/src/data-loader.ts
**/*.{ts,tsx,json}
📄 CodeRabbit inference engine (CLAUDE.md)
Claude model naming convention:
claude-{model-type}-{generation}-{date}(e.g.,claude-sonnet-4-20250514, NOTclaude-4-sonnet-20250514)
Files:
apps/ccusage/src/data-loader.ts
🧠 Learnings (6)
📚 Learning: 2025-09-18T16:07:16.293Z
Learnt from: CR
Repo: ryoppippi/ccusage PR: 0
File: apps/codex/CLAUDE.md:0-0
Timestamp: 2025-09-18T16:07:16.293Z
Learning: --json outputs structured results with aggregated tokens and USD cost included in totals
Applied to files:
apps/ccusage/src/data-loader.ts
📚 Learning: 2025-09-18T16:07:16.293Z
Learnt from: CR
Repo: ryoppippi/ccusage PR: 0
File: apps/codex/CLAUDE.md:0-0
Timestamp: 2025-09-18T16:07:16.293Z
Learning: Each line in session JSONL is an event_msg with payload.type === "token_count"
Applied to files:
apps/ccusage/src/data-loader.ts
📚 Learning: 2025-11-25T14:42:34.734Z
Learnt from: CR
Repo: ryoppippi/ccusage PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-25T14:42:34.734Z
Learning: Applies to **/data-loader.ts : Silently skip malformed JSONL lines during parsing in data loading operations
Applied to files:
apps/ccusage/src/data-loader.ts
📚 Learning: 2025-09-18T16:07:16.293Z
Learnt from: CR
Repo: ryoppippi/ccusage PR: 0
File: apps/codex/CLAUDE.md:0-0
Timestamp: 2025-09-18T16:07:16.293Z
Learning: Tests should use fs-fixture with using to ensure cleanup
Applied to files:
apps/ccusage/src/data-loader.ts
📚 Learning: 2025-09-18T16:06:37.474Z
Learnt from: CR
Repo: ryoppippi/ccusage PR: 0
File: apps/ccusage/CLAUDE.md:0-0
Timestamp: 2025-09-18T16:06:37.474Z
Learning: Applies to apps/ccusage/src/**/*.ts : Use `fs-fixture` with `createFixture()` to simulate Claude data in tests
Applied to files:
apps/ccusage/src/data-loader.ts
📚 Learning: 2025-11-25T14:42:34.734Z
Learnt from: CR
Repo: ryoppippi/ccusage PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-25T14:42:34.734Z
Learning: Applies to **/*.{ts,tsx} : Create mock data using `fs-fixture` with `createFixture()` for Claude data directory simulation in tests
Applied to files:
apps/ccusage/src/data-loader.ts
🧬 Code graph analysis (1)
apps/ccusage/src/data-loader.ts (1)
apps/ccusage/src/_session-blocks.ts (1)
identifySessionBlocks(90-154)
🔇 Additional comments (6)
apps/ccusage/src/data-loader.ts (6)
485-491: LGTM!The
DedupeEntrytype is well-defined and minimal for tracking both position and token count during deduplication.
493-513: LGTM!The logic correctly determines whether to skip an entry: returns
trueonly when a duplicate exists with higher or equaloutput_tokens, ensuring entries with strictly higher token counts are processed.
515-535: LGTM!The replacement logic correctly handles the case where a later entry has higher
output_tokens: it returns the old entry's index for removal and updates the map with the new entry. The interaction withshouldSkipEntryis well-coordinated.
803-865: LGTM!The deduplication integration is correctly implemented:
currentIndexis captured before pushing, correctly identifying the entry's future position- The filtering step cleanly removes replaced entries
- Grouping and aggregation now operate on deduplicated data
964-1042: LGTM!The deduplication pattern is consistently applied to session data loading, mirroring the implementation in
loadDailyUsageData.
1404-1468: LGTM!The deduplication pattern is correctly applied to session block loading, with deduplicated entries properly passed to
identifySessionBlocks.
Address CodeRabbit review feedback - fixture should use projects/project/session/file.jsonl structure to match the glob pattern used by loadDailyUsageData. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
This reverts commit ab7bb44.
Summary
Problem
When Claude Code streams responses, it creates multiple JSONL entries with the same
messageId:requestId. The first entry contains the accurate output token count, but later entries contain only the streaming delta (often 1-3 tokens). The previous deduplication logic kept whichever entry was encountered first, which was essentially random.Solution
Modified
data-loader.tsto:output_tokensfor each unique hash during deduplicationoutput_tokensvalueTest plan
🤖 Generated with Claude Code
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.