You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improve memory usage for arrow-row -> String/BinaryView when utf8 validation disabled (#7917)
# Which issue does this PR close?
- Related to #6057 .
# Rationale for this change
As described in above issue, when constructing a `StringViewArray` from
rows, we currently store inline strings twice: once through `make_view`,
and again in the `values buffer` so that we can validate utf8 in one go.
However, this is suboptimal in terms of memory consumption, so ideally,
we should avoid placing inline strings into the values buffer when UTF-8
validation is disabled.
# What changes are included in this PR?
When UTF-8 validation is disabled, this PR modifies the string/bytes
view array construction from rows as follows:
1. The capacity of the values buffer is set to accommodate only long
strings plus 12 bytes for a single inline string placeholder.
2. All decoded strings are initially appended to the values buffer.
3. If a string turns out to be an inline string, it is included via
`make_view`, and then the corresponding inline portion is truncated from
the values buffer, ensuring the inline string does not appear twice in
the resulting array.
# Are these changes tested?
1. copied & modified existing `fuzz_test` to set disable utf8
validation.
2. Run bench & add bench case when array consists of both inline string
& long strings
# Are there any user-facing changes?
No.
# Considered alternatives
One idea was to support separate buffers for inline strings even when
UTF-8 validation is enabled. However, since we need to call
`decoded_len()` first to determine the target buffer, this approach can
be tricky or inefficient:
- For example, precomputing a boolean flag per string to determine which
buffer to use would increase temporary memory usage.
- Alternatively, appending to the values buffer first and then moving
inline strings to a separate buffer would lead to frequent memcpy
overhead.
Given that datafusion disables UTF-8 validation when using RowConverter,
this PR focuses on improving memory efficiency specifically when
validation is turned off.
---------
Co-authored-by: Andrew Lamb <[email protected]>
0 commit comments