Skip to content

Conversation

@jhorstmann
Copy link
Contributor

Another experiement for speeding up ArrayData. Most array types only contain a single buffer (except for the validity buffer) and also at most one child data object. I think the only types this does not apply to are struct and union arrays. Using a SmallVec trades one comparison against an allocation and the resulting indirection in those cases.

The ArrayData::new method is left in place unchanged for compatibility.

TODO:

  • Use the optimized new_smallvec method in kernels whenever applicable
  • Run all benchmarks on a non-throttling machine

@jorgecarleitao @Dandandan this is also related to the discussion on #9271. I tried this initially some month ago and the benchmarks were not conclusive, but might be worth trying again or in combination with removing the Arc indirection.

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

/// could hold different kinds of buffers (e.g., value buffer, value offset buffer)
/// at different positions.
buffers: Vec<Buffer>,
buffers: SmallVec<[Buffer; 1]>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if using something like a specific enum might also work (and be more memory efficient) as well:

pub enum  ArrayBuffer {
  One(ArrayDataRef),
  Two(ArrayDataRef, ArrayDataRef),
  Many(Vec<ArrayDataRef>)
}

I honestly don't know what the maximum number of buffers any specific ArrayData can have,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, max is 3 including the null buffer, so max is 2 in this context. See MutableDataArray, where we use two buffers on the stack to avoid the overhead. When unused, a buffer takes something like a usize + a pointer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I haven't checked the memory layout of both, but in the common case of 1 buffer this could inline better (in the absence of LTO). Since we already have the validity bitmap in a separate Option, the variants are Zero, One and Many.

@alamb
Copy link
Contributor

alamb commented Mar 3, 2021

@jhorstmann I am closing this PR for the time being to clean up the Rust/Arrow PR backlog. Please let me know if this is a mistake

@alamb alamb closed this Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants