Skip to content

Conversation

zoroglucihat
Copy link

@zoroglucihat zoroglucihat commented Aug 22, 2025

Add StreamingParser for memory-efficient CSV parsing of large files

What?

This PR introduces a new StreamingParser class for the k6/experimental/csv module that enables memory-efficient parsing of large CSV files without loading them entirely into memory.

Key additions:

  • StreamingParser: New parser class that takes file paths instead of File objects
  • StreamingReader: Underlying reader with 64KB buffer for streaming file processing
  • Direct OS filesystem access: Bypasses k6's internal file caching that loads entire files
  • Identical API: Same interface as existing Parser for seamless migration
  • Comprehensive tests: Full test suite covering all streaming functionality

Why?

Problem: The current CSV parser causes Out-of-Memory (OOM) errors when processing large files because:

  1. fs.open() loads entire files into memory using io.ReadAll()
  2. k6's file system cache stores complete file content as []byte
  3. A 12GB CSV file immediately consumes 12GB+ RAM during initialization

Root Cause: The file system module (internal/js/modules/k6/experimental/fs/cache.go:102) uses io.ReadAll() which is unsuitable for large file processing.

Impact: Users cannot process large CSV files (>RAM size) for load testing scenarios with extensive test data.

Before (OOM error):

import fs from 'k6/experimental/fs';
import csv from 'k6/experimental/csv';

const csvFile = await fs.open("12GB-file.csv");  // Loads entire file!
const parser = new csv.Parser(csvFile, { skipFirstLine: true });

After (memory-efficient):

import csv from 'k6/experimental/csv';

const parser = new csv.StreamingParser("12GB-file.csv", { skipFirstLine: true });

Performance improvement:

  • Memory usage: 12GB+ → ~64KB constant
  • Initialization: Instant vs. loading entire file upfront
  • Scalability: Handles any file size without OOM

Checklist

  • I have performed a self-review of my code.
  • I have commented on my code, particularly in hard-to-understand areas.
  • I have added tests for my changes.
  • I have run linter and tests locally (make check) and all pass.

Checklist: Documentation (only for k6 maintainers and if relevant)

Please do not merge this PR until the following items are filled out.

  • I have added the correct milestone and labels to the PR.
  • I have updated the release notes: link
  • I have updated or added an issue to the k6-documentation: grafana/k6-docs#NUMBER if applicable
  • I have updated or added an issue to the TypeScript definitions: grafana/k6-DefinitelyTyped#NUMBER if applicable

Related PR(s)/Issue(s)

Closes #5080

Testing

All tests pass including new streaming-specific tests:

- Add StreamingReader with 64KB buffer for large file processing
- Add StreamingParser class with same API as regular Parser
- Bypass k6 file caching to avoid loading entire files into memory
- Support all existing CSV parser options (skipFirstLine, asObjects, etc.)
- Add comprehensive test suite for streaming functionality
- Add usage example for large CSV files

Fixes grafana#5080
@zoroglucihat zoroglucihat requested a review from a team as a code owner August 22, 2025 23:12
@zoroglucihat zoroglucihat requested review from oleiade and codebien and removed request for a team August 22, 2025 23:12
@CLAassistant
Copy link

CLAassistant commented Aug 22, 2025

CLA assistant check
All committers have signed the CLA.

@codebien
Copy link
Contributor

codebien commented Sep 1, 2025

Hey @zoroglucihat, is a new API truly necessary?

The proposed solution doesn't appear to be ideal. Have you verified that the reported issue isn't primarily a bug? It always recommended to discuss the introduction of a new API into the dedicated issue before to submit a new pull request.

In addition, this solution seems to be largely LLM-generated. If a clear illustration of the solution isn't provided, we will close this, similar to our previous action.

I report the comment from #5066 (comment) for context:

As a general note for future contributions (from anyone in the community), it is vital that all submitted code is thoroughly reviewed and tested by its author. When using AI tools to assist in development, strong human supervision is essential to ensure the final result is robust, idiomatic, and truly solves the problem at hand. This ensures the review process is productive and respectful of everyone's time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSV Parser causes OOM when loading big files
3 participants