Skip to content

Conversation

@sjoubert
Copy link

@sjoubert sjoubert commented Jun 24, 2025

The current NEWLINE behavior of reading as much CR/LF as possible basically means empty rows are/should be ignored.

This fixes the issue of pushing an empty row that can occur when the data chunk ends in the middle of such a CR/LF stream. This can happen more or less artificially with with a block of empty lines or because the chunk ended right in the middle of a Windows CRLF endline (yes I was very unlucky, the data in my 10Mo+ file aligned just the wrong way).

Side note: I'm not very fond of the "ignore all empty rows" behavior, but that's another topic. Here I assumed the behavior is intended by design, or at least should be consistent in every cases.

The current NEWLINE behavior of reading as much CR/LF as possible
basically means empty rows are/should be ignored.
This fixes the issue of pushing an empty row that can occur when the
data chunk ends in the middle of such a CR/LF stream (either because of
a block of empty lines or if you are unlucky because the chunk ended
right in the middle of a Windows CRLF endline)
@vincentlaucsb
Copy link
Owner

vincentlaucsb commented Jul 11, 2025

I would prefer if this behavior was added as an optional flag via CSVFormat and the VariableColumnPolicy enum, e.g. adding a DROP_EMPTY option.

image

I do not think adding another condition inside the main parse loop, which would slow performance for all users, is worth it to handle a very specific edge case.

@sjoubert
Copy link
Author

Adding an explicit option DROP_EMPTY seems a good idea. I'll see if I can do something, but does that mean we need to stop reading as much CR, LF characters as possible? Otherwise not all empty rows would be kept for the users. But we probably go back to the initial issue of properly identifying endlines (which to me is the whole root issue of these discussion and behavior).

And to me, DROP_EMPTY doesn't fix the original bug, it only avoids making it happen if users actually select DROP_EMPTY. If not selected and the chunk border still falls between CR and LF then an "empty row" will be pushed and wrongly reported, if THROW. But there are no actual empty lines in the file there.

@vincentlaucsb
Copy link
Owner

Adding an explicit option DROP_EMPTY seems a good idea. I'll see if I can do something, but does that mean we need to stop reading as much CR, LF characters as possible? Otherwise not all empty rows would be kept for the users. But we probably go back to the initial issue of properly identifying endlines (which to me is the whole root issue of these discussion and behavior).

And to me, DROP_EMPTY doesn't fix the original bug, it only avoids making it happen if users actually select DROP_EMPTY. If not selected and the chunk border still falls between CR and LF then an "empty row" will be pushed and wrongly reported, if THROW. But there are no actual empty lines in the file there.

That's a valid concern. I'd like to see if there's a way to handle both Linux \n and Windows \r\n newlines without overly complicating the parsing loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants