Skip to content

Conversation

@kszucs
Copy link
Member

@kszucs kszucs commented Jun 10, 2025

Prototype implementation for an arrow-rs based page pruning parquet reader for low latency limit/offset queries.

It is a standalone library for now, haven't been integrated to the viewer yet.

Install

cd libs/libviewer
pip install maturin
maturin develop -r  

Index Dataset

dv --use-cache nvidia/OpenCodeReasoning index

This uses huggingface_hub to download and cache the dataset files.
Then creates a metadata file for each parque file in the dataset with
offset index included.

Remove --use-cache to directly download the files from the hub.

Execute a limit/offset query

dv --use-cache nvidia/OpenCodeReasoning query --limit 10 --offset 0

This will query the dataset using the local metadata index files.
The scanner only reads the necessary parquet pages to minimize the
network traffic.

Remove --use-cache to directly query data from the hub.

Integration and testing

Before covering it with tests, it would be nice to see the necessary API for integration.

@lhoestq
Copy link
Member

lhoestq commented Jul 7, 2025

back to this PR - sorry for the delay

@lhoestq
Copy link
Member

lhoestq commented Jul 7, 2025

I created #3213 to continue this PR and integrate this in the /rows service :)

@kszucs kszucs marked this pull request as ready for review September 15, 2025 10:17
@kszucs
Copy link
Member Author

kszucs commented Sep 15, 2025

I'm hitting #3229 as well.

def from_parquet_metadata_items(
parquet_file_metadata_items: list[ParquetFileMetadataItem],
features: Optional[Features],
parquet_files: list[ParquetFileMetadataItem],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet_file_metadata_items and parquet_files_metadata variable names were confusing due to the extensive use and separation of data and metadata files, so I rather renamed these variables to parquet_files.

parquet_file_metadata_items, key=lambda parquet_file_metadata: parquet_file_metadata["filename"]
)
parquet_files_urls = [parquet_file_metadata["url"] for parquet_file_metadata in parquet_files_metadata]
parquet_files_urls = [f["url"] for f in parquet_files]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parquet files used to be sorted here, but the page pruning reader requires them sorted as well, so I moved it to the new RowsIndex._init_dataset_info() method below.

parquet_metadata_directory: StrPath,
max_arrow_data_in_memory: int,
unsupported_features: list[FeatureType] = [],
data_store="hf://"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It supposed to correspond to https but we cannot pass the python filesystem object down to rust, so we need to use an URI instead.

metadata_store=f"file://{parquet_metadata_directory}"
)

def _init_dataset_info(self, parquet_metadata_directory: StrPath):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled of some logic from _init_parquet_index(), now it is responsible to query the mongo cache and parse out the revision, parquet_files in a sorted manner and the features unless they are absent where we read the first file's parquet metadata to have the corresponding arrow schema in case of an empty result set.

f"Create libviewer.Dataset for dataset={self.dataset}, config={self.config}, split={self.split}"
)
try:
from libviewer import Dataset
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be mandatory but I still need to update the build environments to include the rust toolchain.

dataset: str,
config: str,
split: str,
data_store="hf://"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required for testing purposes.

I also noticed that Indexer only serves as a RowsIndex factory, maybe we should instantiate the RowsIndex objects directly so we wouldn't need to wire through the parameters.

@kszucs
Copy link
Member Author

kszucs commented Oct 14, 2025

Reopened from an upstream branch so that the e2e tests can properly run, closing in favor of #3244

@kszucs kszucs closed this Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants