-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Describe the enhancement requested
Background
I am the developer of Carpet, a library that builds upon the Parquet Java project.
Carpet reuses the primitives provided by Parquet Java library: to instantiate ParquetWriter and ParquetReader classes, I use and extend existing builders. To support all Parquet Java features transparently to users, Carpet exposes the same configuration options as the underlying library, and transitively reuse all logic implemented by ParquetWriter, InternalParquetRecordWriter, ParquetReader and InternalParquetRecordReader.
My goal is to implement data partitioning and parallel read/write capabilities in Carpet. I want to continue reusing existing builders and logic, but with additional abstractions for partitioning and parallelism. To support this, I need to build multiple instances of ParquetWriter/ParquetReader based on a single builder configuration, only changing the target file to write/read before calling the build method.
With the existing builders, it's not possible to implement this feature because an Input/OutputFile is a required argument in the static builder-creation methods. This forces the file to be defined at the very beginning of the configuration process, preventing the builder instance from being reused for different files.
Suggested change
I propose to add empty constructor builders for ParquetWriter and ParquetReader classes.
The builder pattern usually allows configuring all options in different stages and order, but guarantees that all required parameters are set and consistent for the instances to be created. Why force setting the file at the beginning instead of allowing it to be set later?
ParquetWriter<Group> writer = ExampleParquetWriter.builder(outputFile)
.withAllocator(allocator)
.withCompressionCodec(UNCOMPRESSED)
.withRowGroupSize(1024)
.withPageSize(1024)
.withDictionaryPageSize(512)
.enableDictionaryEncoding()
.withValidation(false)
.withWriterVersion(version)
.withConf(conf)
.build();Could also be valid as:
ExampleParquetWriter.Builder<Group> builder = ExampleParquetWriter.builder()
.withAllocator(allocator)
.withCompressionCodec(UNCOMPRESSED)
.withRowGroupSize(1024)
.withPageSize(1024)
.withDictionaryPageSize(512)
.enableDictionaryEncoding()
.withValidation(false)
.withWriterVersion(version)
.withConf(conf);
ParquetWriter<Group> writer1 = builder.withFile(outputFile1).build();
...
ParquetWriter<Group> writer2 = builder.withFile(outputFile2).build();My proposal is to:
- Introduce no-argument builder() static methods for the ParquetWriter and ParquetReader builders
- Allow setting the file later in the builder configuration process (only for InputFile and OutputFile types, leaving apart the deprecated Hadoop Path)
- Ensure validation is performed within the
buildandwithFilemethods to throw an exception if the file has not been set, preserving the builder's safety guarantees.
These changes would be fully backward-compatible, as the existing builder(file) methods would remain untouched.
This would allow for more flexible usage patterns and better support for advanced use cases like partitioning and parallel processing in Carpet or even as part of Parquet Java.
Component(s)
Core