Skip to content

Conversation

@GrigoryEvko
Copy link

Add Mask::count() method

Motivation

The Mask API currently provides boolean queries (any(), all()) and index queries (first_set()), but lacks a method to count the number of true elements. This forces users to either convert to arrays and iterate, or manually use to_bitmask().count_ones(), which exposes implementation details.

Current workarounds:

// Option 1: Verbose, requires knowing bitmask representation
let count = mask.to_bitmask().count_ones() as usize;

// Option 2: Inefficient, allocates array
let count = mask.to_array().iter().filter(|&&x| x).count();

Proposed:

let count = mask.count();

This pattern appears frequently in SIMD code when pre-sizing allocations to avoid reallocation overhead:

// Two-pass filtering: count matches, allocate once, then collect
let mask = values.simd_gt(threshold);
let mut results = Vec::with_capacity(mask.count());
for (i, &val) in data.iter().enumerate() {
    if mask.test(i) {
        results.push(val);
    }
}

Other common use cases include histogram generation, SQL-style COUNT aggregation, and sparse data analysis.

API Design

impl<T, const N: usize> Mask<T, N>
where
    T: MaskElement,
    LaneCount<N>: SupportedLaneCount,
{
    #[inline]
    #[must_use]
    pub fn count(self) -> usize {
        self.to_bitmask().count_ones() as usize
    }
}

Design decisions:

  1. Returns usize - Consistent with Iterator::count() and suitable for array indexing
  2. Named count() not len() - len() implies container size; count() matches the semantic operation (counting true values)
  3. Simple #[must_use] attribute - Follows Vec::len() and slice::len() precedent (no message)
  4. Not const - to_bitmask() uses intrinsics that cannot be const-evaluated

Implementation

The implementation delegates to to_bitmask().count_ones(), which already uses LLVM's llvm.ctpop intrinsic. This compiles to efficient platform-specific instructions:

  • x86/x86_64: POPCNT (SSE4.2)
  • ARM/AArch64: CNT (NEON)
  • RISC-V: CPOP (Zbb extension)
  • WebAssembly: i64.popcnt

No platform-specific code is required; LLVM handles optimization for each target.

Performance

Benchmarked on x86_64 (Intel Core i7-14700HX, -C target-cpu=native):

Mask size count() Manual iteration Speedup
mask32x4 0.36 ns 0.52 ns 44%
mask32x8 0.45 ns 0.76 ns 69%
mask32x16 1.04 ns 1.15 ns 11%

Assembly verification shows the expected codegen (x86_64):

vmovmskps  eax, ymm0    ; Extract mask to integer
popcnt     eax, eax     ; Population count

The operation is branch-free and density-independent: mask16 measured at 1.03-1.05ns across all densities (0%, 25%, 50%, 75%, 100%), confirming constant-time behavior regardless of true element count.

GrigoryEvko and others added 2 commits November 15, 2025 23:35
Implements a simple, efficient method to count the number of `true`
elements in a SIMD mask. This is a common operation needed for:
- Pre-sizing allocations before filtering
- SQL-style COUNT(WHERE ...) operations
- Histogram generation
- Sparse data statistics

Implementation delegates to `to_bitmask().count_ones()`, which compiles
to a single POPCNT instruction on x86_64 and equivalent efficient
instructions on other platforms (CNT on ARM, CPOP on RISC-V, i64.popcnt
on WASM).

Performance: ~0.7ns per operation, O(1) regardless of bit density.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
The feature stdarch_x86_avx512 has been stable since Rust 1.89.0
and no longer requires a feature gate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant