-
Notifications
You must be signed in to change notification settings - Fork 96
feat: primitive parquet reader with page pruning #3244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…try install is called at the build phase
| cache: "poetry" | ||
| cache-dependency-path: | | ||
| ${{ inputs.working-directory }}/poetry.lock | ||
| # cache: "poetry" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I temporarily turned it off because it wasn't installing the optional libviewer dependency properly.
| # Build with: docker build --target <service_name> -t <tag> . | ||
|
|
||
| ARG PYTHON_VERSION=3.12.11 | ||
| FROM python:${PYTHON_VERSION}-slim AS viewer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building the rust based libviewer as a wheel to not include the compiler toolchains in the final docker images.
| optional = true | ||
|
|
||
| [tool.poetry.group.libviewer.dependencies] | ||
| libviewer = { path = "../libviewer", develop = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I added it as a mandatory dependency but I wasn't able to convince poetry to skip installing it from source in the docker image, but rather use a prebuilt wheel, see https://github.com/huggingface/dataset-viewer/pull/3244/files#r2432505025.
Apparently the path dependencies doesn't work well with compiled extension modules. Ideally we should build wheels for all the internal libs (libviewer, libcommon, libapi) but the dependency versions pinned in the pyproject files are more loose than what we have in the poetry lockfiles and some of the builds/tests are sensitive to those dependencies.
So I chose to define libviewer as an optional dependency which we install only to the relevant services using prebuilt wheels in the containers and using --with libviewer during local development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good ! The main point to address is raising TooBigRows when possible to avoid OOMing the /rows worker
| // 4. collect the record batches into a single vector | ||
|
|
||
| let plan = self.plan(limit, offset).await?; | ||
| let files_to_index: Vec<IndexedFile> = plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment to say that it's not used ? (or remove it)
| # pa_table, truncated_columns = rows_index.query_truncated_binary( | ||
| # offset=offset, length=length | ||
| # ) | ||
| pa_table = rows_index.query_with_page_pruning(offset=offset, length=length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can see later to truncate binary data (sometimes users have a column with very long binary data and we truncate them when reading them to not OOM)
Though what we will need right away is to raise TooBigRows if the resulting record batches are likely to cause a OOM (if they can use >300MB of ram). We can use a simple heuristic based on average row size in the row group to know if it's safe to run the query or not
Prototype implementation for an arrow-rs based page pruning parquet reader for low latency limit/offset queries.
It is a standalone library for now, haven't been integrated to the viewer yet.
Install
cd libs/libviewer pip install maturin maturin develop -rIndex Dataset
This uses
huggingface_hubto download and cache the dataset files.Then creates a metadata file for each parque file in the dataset with
offset index included.
Remove
--use-cacheto directly download the files from the hub.Execute a limit/offset query
This will query the dataset using the local metadata index files.
The scanner only reads the necessary parquet pages to minimize the
network traffic.
Remove
--use-cacheto directly query data from the hub.Integration and testing
Before covering it with tests, it would be nice to see the necessary API for integration.
Supersedes #3199