-
Notifications
You must be signed in to change notification settings - Fork 96
feat: primitive parquet reader with page pruning #3199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
back to this PR - sorry for the delay |
|
I created #3213 to continue this PR and integrate this in the /rows service :) |
|
I'm hitting #3229 as well. |
| def from_parquet_metadata_items( | ||
| parquet_file_metadata_items: list[ParquetFileMetadataItem], | ||
| features: Optional[Features], | ||
| parquet_files: list[ParquetFileMetadataItem], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet_file_metadata_items and parquet_files_metadata variable names were confusing due to the extensive use and separation of data and metadata files, so I rather renamed these variables to parquet_files.
| parquet_file_metadata_items, key=lambda parquet_file_metadata: parquet_file_metadata["filename"] | ||
| ) | ||
| parquet_files_urls = [parquet_file_metadata["url"] for parquet_file_metadata in parquet_files_metadata] | ||
| parquet_files_urls = [f["url"] for f in parquet_files] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parquet files used to be sorted here, but the page pruning reader requires them sorted as well, so I moved it to the new RowsIndex._init_dataset_info() method below.
| parquet_metadata_directory: StrPath, | ||
| max_arrow_data_in_memory: int, | ||
| unsupported_features: list[FeatureType] = [], | ||
| data_store="hf://" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It supposed to correspond to https but we cannot pass the python filesystem object down to rust, so we need to use an URI instead.
| metadata_store=f"file://{parquet_metadata_directory}" | ||
| ) | ||
|
|
||
| def _init_dataset_info(self, parquet_metadata_directory: StrPath): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulled of some logic from _init_parquet_index(), now it is responsible to query the mongo cache and parse out the revision, parquet_files in a sorted manner and the features unless they are absent where we read the first file's parquet metadata to have the corresponding arrow schema in case of an empty result set.
| f"Create libviewer.Dataset for dataset={self.dataset}, config={self.config}, split={self.split}" | ||
| ) | ||
| try: | ||
| from libviewer import Dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be mandatory but I still need to update the build environments to include the rust toolchain.
| dataset: str, | ||
| config: str, | ||
| split: str, | ||
| data_store="hf://" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Required for testing purposes.
I also noticed that Indexer only serves as a RowsIndex factory, maybe we should instantiate the RowsIndex objects directly so we wouldn't need to wire through the parameters.
…try install is called at the build phase
|
Reopened from an upstream branch so that the e2e tests can properly run, closing in favor of #3244 |
Prototype implementation for an arrow-rs based page pruning parquet reader for low latency limit/offset queries.
It is a standalone library for now, haven't been integrated to the viewer yet.
Install
cd libs/libviewer pip install maturin maturin develop -rIndex Dataset
This uses
huggingface_hubto download and cache the dataset files.Then creates a metadata file for each parque file in the dataset with
offset index included.
Remove
--use-cacheto directly download the files from the hub.Execute a limit/offset query
This will query the dataset using the local metadata index files.
The scanner only reads the necessary parquet pages to minimize the
network traffic.
Remove
--use-cacheto directly query data from the hub.Integration and testing
Before covering it with tests, it would be nice to see the necessary API for integration.