-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-47103: [Statistics][C++] Implement Statistics specification attribute ARROW:null_count:approximate #47969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
@kou |
|
Yes. I'll do it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for approximate null counts in Arrow array statistics by extending the null_count field to support both exact (int64_t) and approximate (double) values using std::variant. This aligns with the existing pattern used for distinct_count.
Key changes:
- Changed
null_countfromstd::optional<int64_t>tostd::optional<CountType>(variant of int64_t and double) - Added
is_null_count_exactproperty to distinguish between exact and approximate null counts - Updated all related tests and comparison logic to handle the variant type
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| cpp/src/arrow/array/statistics.h | Changed null_count type from int64_t to CountType variant |
| cpp/src/arrow/record_batch.cc | Added logic to handle both exact and approximate null counts when creating statistics arrays |
| cpp/src/arrow/compare.cc | Updated equality comparison to use ArrayStatisticsOptionalValueEquals for null_count |
| cpp/src/arrow/record_batch_test.cc | Renamed test and added new test for approximate null count |
| cpp/src/arrow/array/statistics_test.cc | Added test for approximate null count and updated existing tests |
| cpp/src/arrow/array/array_test.cc | Updated variable type and test assertions to handle variant null_count |
| cpp/src/parquet/arrow/arrow_statistics_test.cc | Updated assertions to extract int64_t from variant null_count |
| python/pyarrow/includes/libarrow.pxd | Changed null_count type from int64_t to CArrayStatisticsCountType |
| python/pyarrow/array.pxi | Added is_null_count_exact property and updated null_count to handle variant |
| python/pyarrow/tests/parquet/test_parquet_file.py | Added assertion to verify is_null_count_exact is True |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@kou, can you check this out? |
3ea2e5b to
20d376f
Compare
kou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
remover logger header correct is_null_count_exact comment
|
Oh, I forgot to add Ruby related changes. I'll add them. |
20d376f to
3983dcb
Compare
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit c10847c. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 14 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
Enable
ARROW:null_count:approximatesupport forarrow::ArrayStatistics, along with the corresponding GLib, Ruby and Python bindings.What changes are included in this PR?
Enable
ARROW:null_count:approximatein C++ and bind it toArrayStatisticsin GLib, Ruby and Python.Are these changes tested?
Yes, I ran the relevant unit tests.
Are there any user-facing changes?
Yes.
The type of
arrow::ArrayStatistics::null_counthas been changed fromstd::optional<int64_t>tostd::optional<CountType>New
garrow_array_statistics_is_null_count_exact()/garrow_array_statistics_get_null_count_{exact,approximate}()functions in GLib.Add support for approximate value in
Arrow::ArrayStatistics#null_countin Ruby.A new field
is_null_count_exacthas been added toArrayStatisticsin Python.GitHub Issue: [Statistics][C++] Implement Statistics specification attribute ARROW:null_count:approximate #47103