Skip to content

Conversation

@andishgar
Copy link
Contributor

@andishgar andishgar commented Oct 27, 2025

Rationale for this change

Enable ARROW:null_count:approximate support for arrow::ArrayStatistics, along with the corresponding GLib, Ruby and Python bindings.

What changes are included in this PR?

Enable ARROW:null_count:approximate in C++ and bind it to ArrayStatistics in GLib, Ruby and Python.

Are these changes tested?

Yes, I ran the relevant unit tests.

Are there any user-facing changes?

Yes.

  • The type of arrow::ArrayStatistics::null_count has been changed from std::optional<int64_t> to std::optional<CountType>

  • New garrow_array_statistics_is_null_count_exact()/garrow_array_statistics_get_null_count_{exact,approximate}() functions in GLib.

  • Add support for approximate value in Arrow::ArrayStatistics#null_count in Ruby.

  • A new field is_null_count_exact has been added to ArrayStatistics in Python.

  • GitHub Issue: [Statistics][C++] Implement Statistics specification attribute ARROW:null_count:approximate #47103

@github-actions
Copy link

⚠️ GitHub issue #47103 has been automatically assigned in GitHub to PR creator.

@andishgar andishgar marked this pull request as draft October 28, 2025 00:08
@andishgar andishgar marked this pull request as ready for review October 28, 2025 08:45
@andishgar
Copy link
Contributor Author

@kou
Regarding the Ruby binding, would it be possible to ask someone to work on it?

@kou
Copy link
Member

kou commented Oct 29, 2025

Yes. I'll do it.

@kou kou requested a review from Copilot October 29, 2025 05:50
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for approximate null counts in Arrow array statistics by extending the null_count field to support both exact (int64_t) and approximate (double) values using std::variant. This aligns with the existing pattern used for distinct_count.

Key changes:

  • Changed null_count from std::optional<int64_t> to std::optional<CountType> (variant of int64_t and double)
  • Added is_null_count_exact property to distinguish between exact and approximate null counts
  • Updated all related tests and comparison logic to handle the variant type

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
cpp/src/arrow/array/statistics.h Changed null_count type from int64_t to CountType variant
cpp/src/arrow/record_batch.cc Added logic to handle both exact and approximate null counts when creating statistics arrays
cpp/src/arrow/compare.cc Updated equality comparison to use ArrayStatisticsOptionalValueEquals for null_count
cpp/src/arrow/record_batch_test.cc Renamed test and added new test for approximate null count
cpp/src/arrow/array/statistics_test.cc Added test for approximate null count and updated existing tests
cpp/src/arrow/array/array_test.cc Updated variable type and test assertions to handle variant null_count
cpp/src/parquet/arrow/arrow_statistics_test.cc Updated assertions to extract int64_t from variant null_count
python/pyarrow/includes/libarrow.pxd Changed null_count type from int64_t to CArrayStatisticsCountType
python/pyarrow/array.pxi Added is_null_count_exact property and updated null_count to handle variant
python/pyarrow/tests/parquet/test_parquet_file.py Added assertion to verify is_null_count_exact is True

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kou kou self-requested a review as a code owner October 29, 2025 05:55
@andishgar
Copy link
Contributor Author

@kou, can you check this out?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Nov 4, 2025
@andishgar andishgar force-pushed the null_count_approximate branch from 3ea2e5b to 20d376f Compare November 9, 2025 13:23
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 9, 2025
@andishgar andishgar requested a review from kou November 9, 2025 16:59
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Nov 10, 2025
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Nov 10, 2025
@kou
Copy link
Member

kou commented Nov 10, 2025

Oh, I forgot to add Ruby related changes. I'll add them.

@kou kou force-pushed the null_count_approximate branch from 20d376f to 3983dcb Compare November 10, 2025 07:32
@github-actions github-actions bot added Component: Ruby awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Nov 10, 2025
@kou kou merged commit c10847c into apache:main Nov 10, 2025
42 of 46 checks passed
@kou kou removed the awaiting changes Awaiting changes label Nov 10, 2025
@andishgar andishgar deleted the null_count_approximate branch November 10, 2025 13:10
@andishgar andishgar restored the null_count_approximate branch November 10, 2025 13:10
@andishgar andishgar deleted the null_count_approximate branch November 10, 2025 15:09
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit c10847c.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 14 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants