Skip to content

Commit d7d847a

Browse files
authored
[Parquet] Add tests for IO/CPU access in parquet reader (#7971)
# Which issue does this PR close? - Part of #8000 - Related to #7850 # Rationale for this change There is quite a bit of code in the current Parquet sync and async readers related to IO patterns that I do not think is not covered by existing tests. As I refactor the guts of the readers into the PushDecoder, I would like to ensure we don't introduce regressions in existing functionality. I would like to add tests that cover the IO patterns of the Parquet Reader so I don't break it # What changes are included in this PR? Add tests which 1. Creates a temporary parquet file with a known row group structure 2. Reads data from that file using the Arrow Parquet Reader, recording the IO operations 3. Asserts the expected IO patterns based on the read operations in a human understandable behavior This is done for both the sync and async readers. I am sorry this is such a massive PR, but it is entirely tests and I think it is quite important. I could break the sync or async tests into their own PR, but this seems uncessary # Are these changes tested? Yes, indeed the entire PR is only tests # Are there any user-facing changes?
1 parent f87f60e commit d7d847a

File tree

6 files changed

+1586
-7
lines changed

6 files changed

+1586
-7
lines changed

parquet/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ base64 = { version = "0.22", default-features = false, features = ["std"] }
7878
criterion = { version = "0.5", default-features = false, features = ["async_futures"] }
7979
snap = { version = "1.0", default-features = false }
8080
tempfile = { version = "3.0", default-features = false }
81+
insta = "1.43.1"
8182
brotli = { version = "8.0", default-features = false, features = ["std"] }
8283
flate2 = { version = "1.0", default-features = false, features = ["rust_backend"] }
8384
lz4_flex = { version = "0.11", default-features = false, features = ["std", "frame"] }

parquet/src/file/reader.rs

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -48,11 +48,12 @@ pub trait Length {
4848
/// Generates [`Read`]ers to read chunks of a Parquet data source.
4949
///
5050
/// The Parquet reader uses [`ChunkReader`] to access Parquet data, allowing
51-
/// multiple decoders to read concurrently from different locations in the same file.
51+
/// multiple decoders to read concurrently from different locations in the same
52+
/// file.
5253
///
53-
/// The trait provides:
54-
/// * random access (via [`Self::get_bytes`])
55-
/// * sequential (via [`Self::get_read`])
54+
/// The trait functions both as a reader and a factory for readers.
55+
/// * random access via [`Self::get_bytes`]
56+
/// * sequential access via the reader returned via factory method [`Self::get_read`]
5657
///
5758
/// # Provided Implementations
5859
/// * [`File`] for reading from local file system

0 commit comments

Comments
 (0)