Don't push empty rows #271

sjoubert · 2025-06-24T09:35:20Z

The current NEWLINE behavior of reading as much CR/LF as possible basically means empty rows are/should be ignored.

This fixes the issue of pushing an empty row that can occur when the data chunk ends in the middle of such a CR/LF stream. This can happen more or less artificially with with a block of empty lines or because the chunk ended right in the middle of a Windows CRLF endline (yes I was very unlucky, the data in my 10Mo+ file aligned just the wrong way).

Side note: I'm not very fond of the "ignore all empty rows" behavior, but that's another topic. Here I assumed the behavior is intended by design, or at least should be consistent in every cases.

The current NEWLINE behavior of reading as much CR/LF as possible basically means empty rows are/should be ignored. This fixes the issue of pushing an empty row that can occur when the data chunk ends in the middle of such a CR/LF stream (either because of a block of empty lines or if you are unlucky because the chunk ended right in the middle of a Windows CRLF endline)

vincentlaucsb · 2025-07-11T03:31:47Z

I would prefer if this behavior was added as an optional flag via CSVFormat and the VariableColumnPolicy enum, e.g. adding a DROP_EMPTY option.

I do not think adding another condition inside the main parse loop, which would slow performance for all users, is worth it to handle a very specific edge case.

sjoubert · 2025-07-11T08:11:05Z

Adding an explicit option DROP_EMPTY seems a good idea. I'll see if I can do something, but does that mean we need to stop reading as much CR, LF characters as possible? Otherwise not all empty rows would be kept for the users. But we probably go back to the initial issue of properly identifying endlines (which to me is the whole root issue of these discussion and behavior).

And to me, DROP_EMPTY doesn't fix the original bug, it only avoids making it happen if users actually select DROP_EMPTY. If not selected and the chunk border still falls between CR and LF then an "empty row" will be pushed and wrongly reported, if THROW. But there are no actual empty lines in the file there.

vincentlaucsb · 2025-07-13T19:51:28Z

Adding an explicit option DROP_EMPTY seems a good idea. I'll see if I can do something, but does that mean we need to stop reading as much CR, LF characters as possible? Otherwise not all empty rows would be kept for the users. But we probably go back to the initial issue of properly identifying endlines (which to me is the whole root issue of these discussion and behavior).

And to me, DROP_EMPTY doesn't fix the original bug, it only avoids making it happen if users actually select DROP_EMPTY. If not selected and the chunk border still falls between CR and LF then an "empty row" will be pushed and wrongly reported, if THROW. But there are no actual empty lines in the file there.

That's a valid concern. I'd like to see if there's a way to handle both Linux \n and Windows \r\n newlines without overly complicating the parsing loop.

sjoubert force-pushed the chunk-limit-crlf branch from e873d3e to 78b048b Compare June 24, 2025 11:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't push empty rows #271

Don't push empty rows #271

Uh oh!

sjoubert commented Jun 24, 2025 •

edited

Loading

Uh oh!

vincentlaucsb commented Jul 11, 2025 •

edited

Loading

Uh oh!

sjoubert commented Jul 11, 2025

Uh oh!

vincentlaucsb commented Jul 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Don't push empty rows #271

Are you sure you want to change the base?

Don't push empty rows #271

Uh oh!

Conversation

sjoubert commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vincentlaucsb commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjoubert commented Jul 11, 2025

Uh oh!

vincentlaucsb commented Jul 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sjoubert commented Jun 24, 2025 •

edited

Loading

vincentlaucsb commented Jul 11, 2025 •

edited

Loading