feat: primitive parquet reader with page pruning #3244

kszucs · 2025-10-14T21:38:42Z

Prototype implementation for an arrow-rs based page pruning parquet reader for low latency limit/offset queries.

It is a standalone library for now, haven't been integrated to the viewer yet.

Install

cd libs/libviewer
pip install maturin
maturin develop -r

Index Dataset

dv --use-cache nvidia/OpenCodeReasoning index

This uses huggingface_hub to download and cache the dataset files.
Then creates a metadata file for each parque file in the dataset with
offset index included.

Remove --use-cache to directly download the files from the hub.

Execute a limit/offset query

dv --use-cache nvidia/OpenCodeReasoning query --limit 10 --offset 0

This will query the dataset using the local metadata index files.
The scanner only reads the necessary parquet pages to minimize the
network traffic.

Remove --use-cache to directly query data from the hub.

Integration and testing

Before covering it with tests, it would be nice to see the necessary API for integration.

Supersedes #3199

…try install is called at the build phase

kszucs · 2025-10-15T13:08:52Z

.github/workflows/_unit-tests-python.yml

-          cache: "poetry"
-          cache-dependency-path: |
-            ${{ inputs.working-directory }}/poetry.lock
+          # cache: "poetry"


I temporarily turned it off because it wasn't installing the optional libviewer dependency properly.

kszucs · 2025-10-15T13:09:32Z

Dockerfile

 # Build with: docker build --target <service_name> -t <tag> .

+ARG PYTHON_VERSION=3.12.11
+FROM python:${PYTHON_VERSION}-slim AS viewer


Building the rust based libviewer as a wheel to not include the compiler toolchains in the final docker images.

kszucs · 2025-10-15T13:13:38Z

libs/libcommon/pyproject.toml

+optional = true
+
+[tool.poetry.group.libviewer.dependencies]
+libviewer = { path = "../libviewer", develop = true }


Originally I added it as a mandatory dependency but I wasn't able to convince poetry to skip installing it from source in the docker image, but rather use a prebuilt wheel, see https://github.com/huggingface/dataset-viewer/pull/3244/files#r2432505025.

Apparently the path dependencies doesn't work well with compiled extension modules. Ideally we should build wheels for all the internal libs (libviewer, libcommon, libapi) but the dependency versions pinned in the pyproject files are more loose than what we have in the poetry lockfiles and some of the builds/tests are sensitive to those dependencies.

So I chose to define libviewer as an optional dependency which we install only to the relevant services using prebuilt wheels in the containers and using --with libviewer during local development.

lhoestq

Looks pretty good ! The main point to address is raising TooBigRows when possible to avoid OOMing the /rows worker

lhoestq · 2025-10-15T13:33:40Z

libs/libviewer/src/dataset.rs

+        // 4. collect the record batches into a single vector
+
+        let plan = self.plan(limit, offset).await?;
+        let files_to_index: Vec<IndexedFile> = plan


maybe add a comment to say that it's not used ? (or remove it)

lhoestq · 2025-10-15T13:35:03Z

services/rows/src/rows/routes/rows.py

+                            # pa_table, truncated_columns = rows_index.query_truncated_binary(
+                            #     offset=offset, length=length
+                            # )
+                            pa_table = rows_index.query_with_page_pruning(offset=offset, length=length)


We can see later to truncate binary data (sometimes users have a column with very long binary data and we truncate them when reading them to not OOM)

Though what we will need right away is to raise TooBigRows if the resulting record batches are likely to cause a OOM (if they can use >300MB of ram). We can use a simple heuristic based on average row size in the row group to know if it's safe to run the query or not

kszucs mentioned this pull request Oct 14, 2025

feat: primitive parquet reader with page pruning #3199

Closed

kszucs and others added 26 commits October 14, 2025 23:45

build: use a single compose file with .env file

db7838a

chore: add .env.debug configuration

c75897e

chore: add .env.debug

d55b6a0

feat: primitive parquet reader with page pruning

a2511e4

add poetry build for libviewer

3c82de5

add libviewer to rows

687d65f

refactor: only extract metadata and don't try to calculate offset index

2bef53e

ci: update dockerfiles to include the rust toolchain and libviewer

4b21154

chore: pin python to 3.12.11 in libviewer and update lockfile

8a44028

feat: use PageIndexPolicy to optionally read offset index

9d4a643

feat: support querying RowsIndex with page pruning

5166af8

build: add libviewer as a dependency to libcommon

96bff85

style: ruff format libcommon changes

343feb5

chore: use query_with_page_pruning from the rows endpoint

4bfafc1

chore: fix mypy errors

0f815a6

style: import Sequence from collections.abc

bf4d35c

build: don't use libviewer as an editable dependency

53700af

build: try to configure poetry to properly install libviewer

749f31a

ci: temporarily disable poetry cache

6f596c9

style: fixx ruff check errors

d6d6f58

build: relock projects depending on libcommon

47728c6

build: add rust toolchain to more dockerfiles

5592ee3

build: copy the entire libviewer directory in dockerfiles because poe…

796f1b6

…try install is called at the build phase

build: turn libviewer an optional dependency due to build difficulties

a2e0eba

chore: missing api stage from dockerfile

49309c4

ci: install libviewer extra in the libcommon build

78c2a49

kszucs force-pushed the libviewer2 branch from eb51194 to 78c2a49 Compare October 14, 2025 22:00

kszucs added 2 commits October 15, 2025 00:09

style: fix ruff check error in parquet utils

024d532

ci: disable poetry cache

53c8cde

kszucs commented Oct 15, 2025

View reviewed changes

lhoestq reviewed Oct 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: primitive parquet reader with page pruning #3244

feat: primitive parquet reader with page pruning #3244

Uh oh!

kszucs commented Oct 14, 2025

Uh oh!

kszucs Oct 15, 2025

Uh oh!

kszucs Oct 15, 2025

Uh oh!

kszucs Oct 15, 2025

Uh oh!

lhoestq left a comment

Uh oh!

lhoestq Oct 15, 2025

Uh oh!

lhoestq Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: primitive parquet reader with page pruning #3244

Are you sure you want to change the base?

feat: primitive parquet reader with page pruning #3244

Uh oh!

Conversation

kszucs commented Oct 14, 2025

Install

Index Dataset

Execute a limit/offset query

Integration and testing

Uh oh!

kszucs Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

kszucs Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

kszucs Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

lhoestq Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants