feat: primitive parquet reader with page pruning #3199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

kszucs wants to merge 23 commits into huggingface:main from kszucs:libviewer

Member

kszucs commented Jun 10, 2025 •

edited

Loading

Prototype implementation for an arrow-rs based page pruning parquet reader for low latency limit/offset queries.

It is a standalone library for now, haven't been integrated to the viewer yet.

Install

cd libs/libviewer
pip install maturin
maturin develop -r

Index Dataset

dv --use-cache nvidia/OpenCodeReasoning index

This uses huggingface_hub to download and cache the dataset files.
Then creates a metadata file for each parque file in the dataset with
offset index included.

Remove --use-cache to directly download the files from the hub.

Execute a limit/offset query

dv --use-cache nvidia/OpenCodeReasoning query --limit 10 --offset 0

This will query the dataset using the local metadata index files.
The scanner only reads the necessary parquet pages to minimize the
network traffic.

Remove --use-cache to directly query data from the hub.

Integration and testing

Before covering it with tests, it would be nice to see the necessary API for integration.

Member

lhoestq commented Jul 7, 2025

back to this PR - sorry for the delay

lhoestq mentioned this pull request

[Rows] sub-rowgroup loading using arrow-rs + libviewer #3213

Draft

2 tasks

Member

lhoestq commented Jul 7, 2025

I created #3213 to continue this PR and integrate this in the /rows service :)

kszucs mentioned this pull request

feat: use content defined chunking huggingface/datasets#7589

Merged

2 tasks

kszucs force-pushed the libviewer branch from 4028535 to 1d87c30 Compare

September 15, 2025 10:17

kszucs marked this pull request as ready for review

September 15, 2025 10:17

Member Author

kszucs commented Sep 15, 2025

I'm hitting #3229 as well.

kszucs and others added 7 commits

September 23, 2025 10:39


          feat: primitive parquet reader with page pruning

97deea2


          add poetry build for libviewer


          add libviewer to rows

b2d49be


          refactor: only extract metadata and don't try to calculate offset index

92b633d


          ci: update dockerfiles to include the rust toolchain and libviewer

d5d4397


          chore: add cargo files to the rows dev docker image

faeeb0c


          chore: add build-essentials

1b23098

kszucs force-pushed the libviewer branch from 76075cf to 1b23098 Compare

September 23, 2025 08:43

kszucs added 3 commits

September 23, 2025 10:45


          chore: pin python to 3.12.11 in libviewer and update lockfile

26242ff


          feat: use PageIndexPolicy to optionally read offset index

c01f7bc


          feat: support querying RowsIndex with page pruning

1536cd0

kszucs commented

View reviewed changes

libs/libcommon/src/libcommon/parquet_utils.py

    
                  def from_parquet_metadata_items(

                      parquet_file_metadata_items: list[ParquetFileMetadataItem],

                      features: Optional[Features],

                      parquet_files: list[ParquetFileMetadataItem],

Member Author

kszucs Oct 10, 2025

parquet_file_metadata_items and parquet_files_metadata variable names were confusing due to the extensive use and separation of data and metadata files, so I rather renamed these variables to parquet_files.

kszucs commented

View reviewed changes

libs/libcommon/src/libcommon/parquet_utils.py

    
                                  parquet_file_metadata_items, key=lambda parquet_file_metadata: parquet_file_metadata["filename"]

                              )

                              parquet_files_urls = [parquet_file_metadata["url"] for parquet_file_metadata in parquet_files_metadata]

                              parquet_files_urls = [f["url"] for f in parquet_files]

Member Author

kszucs Oct 10, 2025

The parquet files used to be sorted here, but the page pruning reader requires them sorted as well, so I moved it to the new RowsIndex._init_dataset_info() method below.

kszucs commented

View reviewed changes

libs/libcommon/src/libcommon/parquet_utils.py Outdated

    
                      parquet_metadata_directory: StrPath,

                      max_arrow_data_in_memory: int,

                      unsupported_features: list[FeatureType] = [],

                      data_store="hf://"

Member Author

kszucs Oct 10, 2025

It supposed to correspond to https but we cannot pass the python filesystem object down to rust, so we need to use an URI instead.

kszucs commented

View reviewed changes

libs/libcommon/src/libcommon/parquet_utils.py Outdated

    
                          metadata_store=f"file://{parquet_metadata_directory}"

                      )

                  def _init_dataset_info(self, parquet_metadata_directory: StrPath):

Member Author

kszucs Oct 10, 2025

Pulled of some logic from _init_parquet_index(), now it is responsible to query the mongo cache and parse out the revision, parquet_files in a sorted manner and the features unless they are absent where we read the first file's parquet metadata to have the corresponding arrow schema in case of an empty result set.

kszucs commented

View reviewed changes

libs/libcommon/src/libcommon/parquet_utils.py Outdated

    
                          f"Create libviewer.Dataset for dataset={self.dataset}, config={self.config}, split={self.split}"

                      )

                      try:

                          from libviewer import Dataset

Member Author

kszucs Oct 10, 2025

This should be mandatory but I still need to update the build environments to include the rust toolchain.

kszucs commented

View reviewed changes

libs/libcommon/src/libcommon/parquet_utils.py Outdated

    
                      dataset: str,

                      config: str,

                      split: str,

                      data_store="hf://"

Member Author

kszucs Oct 10, 2025

Required for testing purposes.

I also noticed that Indexer only serves as a RowsIndex factory, maybe we should instantiate the RowsIndex objects directly so we wouldn't need to wire through the parameters.

kszucs added 6 commits

October 11, 2025 16:06


          build: add libviewer as a dependency to libcommon


          style: ruff format libcommon changes

b15d7fc


          chore: use query_with_page_pruning from the rows endpoint

d0e7930


          chore: fix mypy errors

c6579ec


          style: import Sequence from collections.abc

c56021e


          build: don't use libviewer as an editable dependency

25943b5

kszucs added 7 commits

October 12, 2025 09:26


          build: try to configure poetry to properly install libviewer

9f15627


          ci: temporarily disable poetry cache

92b0667


          style: fixx ruff check errors

802a823


          build: relock projects depending on libcommon

191add5


          build: add rust toolchain to more dockerfiles

ff056c6


          build: copy the entire libviewer directory in dockerfiles because poe…

c48408d

…try install is called at the build phase


          build: add cargo to PATH

lhoestq mentioned this pull request

autoconverted parquet file has too big cells #1957

Open

kszucs mentioned this pull request

feat: primitive parquet reader with page pruning #3244

Open

Member Author

kszucs commented Oct 14, 2025

Reopened from an upstream branch so that the e2e tests can properly run, closing in favor of #3244

kszucs closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet