Skip to content

Conversation

@kszucs
Copy link
Member

@kszucs kszucs commented Oct 14, 2025

Prototype implementation for an arrow-rs based page pruning parquet reader for low latency limit/offset queries.

It is a standalone library for now, haven't been integrated to the viewer yet.

Install

cd libs/libviewer
pip install maturin
maturin develop -r  

Index Dataset

dv --use-cache nvidia/OpenCodeReasoning index

This uses huggingface_hub to download and cache the dataset files.
Then creates a metadata file for each parque file in the dataset with
offset index included.

Remove --use-cache to directly download the files from the hub.

Execute a limit/offset query

dv --use-cache nvidia/OpenCodeReasoning query --limit 10 --offset 0

This will query the dataset using the local metadata index files.
The scanner only reads the necessary parquet pages to minimize the
network traffic.

Remove --use-cache to directly query data from the hub.

Integration and testing

Before covering it with tests, it would be nice to see the necessary API for integration.

Supersedes #3199

kszucs and others added 26 commits October 14, 2025 23:45
cache: "poetry"
cache-dependency-path: |
${{ inputs.working-directory }}/poetry.lock
# cache: "poetry"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I temporarily turned it off because it wasn't installing the optional libviewer dependency properly.

# Build with: docker build --target <service_name> -t <tag> .

ARG PYTHON_VERSION=3.12.11
FROM python:${PYTHON_VERSION}-slim AS viewer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building the rust based libviewer as a wheel to not include the compiler toolchains in the final docker images.

optional = true

[tool.poetry.group.libviewer.dependencies]
libviewer = { path = "../libviewer", develop = true }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I added it as a mandatory dependency but I wasn't able to convince poetry to skip installing it from source in the docker image, but rather use a prebuilt wheel, see https://github.com/huggingface/dataset-viewer/pull/3244/files#r2432505025.

Apparently the path dependencies doesn't work well with compiled extension modules. Ideally we should build wheels for all the internal libs (libviewer, libcommon, libapi) but the dependency versions pinned in the pyproject files are more loose than what we have in the poetry lockfiles and some of the builds/tests are sensitive to those dependencies.

So I chose to define libviewer as an optional dependency which we install only to the relevant services using prebuilt wheels in the containers and using --with libviewer during local development.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good ! The main point to address is raising TooBigRows when possible to avoid OOMing the /rows worker

// 4. collect the record batches into a single vector

let plan = self.plan(limit, offset).await?;
let files_to_index: Vec<IndexedFile> = plan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment to say that it's not used ? (or remove it)

Comment on lines +103 to +106
# pa_table, truncated_columns = rows_index.query_truncated_binary(
# offset=offset, length=length
# )
pa_table = rows_index.query_with_page_pruning(offset=offset, length=length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can see later to truncate binary data (sometimes users have a column with very long binary data and we truncate them when reading them to not OOM)

Though what we will need right away is to raise TooBigRows if the resulting record batches are likely to cause a OOM (if they can use >300MB of ram). We can use a simple heuristic based on average row size in the row group to know if it's safe to run the query or not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants