-
Notifications
You must be signed in to change notification settings - Fork 189
feat(ops): SplitBySeparators — regex-only splitter (Rust+Py) #1010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ops): SplitBySeparators — regex-only splitter (Rust+Py) #1010
Conversation
keep_separator: KeepSep, | ||
#[serde(default)] | ||
include_empty: bool, | ||
#[serde(default = "default_trim")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we don't have to explicitly set these default values here.
Python SDK already has the default values. It's better to keep default values in a single place (no need to worry about out of sync).
use crate::{fields_value, ops::sdk::*}; | ||
|
||
#[derive(Serialize, Deserialize, Clone, Copy, PartialEq, Eq)] | ||
#[serde(rename_all = "lowercase")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer UPPERCASE
here (or SCREAMING_SNAKE_CASE
, but no difference for this case). It's also the naming convention for enum variants in Python.
/// Mirrors SplitRecursively’s output positions. | ||
#[derive(Debug, Clone, PartialEq, Eq)] | ||
struct OutputPosition { | ||
char_offset: usize, | ||
line: u32, | ||
column: u32, | ||
} | ||
impl OutputPosition { | ||
fn into_output(self) -> value::Value { | ||
value::Value::Struct(fields_value!( | ||
self.char_offset as i64, | ||
self.line as i64, | ||
self.column as i64 | ||
)) | ||
} | ||
} | ||
struct Position { | ||
byte_offset: usize, | ||
output: Option<OutputPosition>, | ||
} | ||
impl Position { | ||
fn new(byte_offset: usize) -> Self { | ||
Self { | ||
byte_offset, | ||
output: None, | ||
} | ||
} | ||
} | ||
|
||
/// Fill OutputPosition for the requested byte offsets. | ||
fn set_output_positions<'a>(text: &str, positions: impl Iterator<Item = &'a mut Position>) { | ||
let mut positions = positions.collect::<Vec<_>>(); | ||
positions.sort_by_key(|o| o.byte_offset); | ||
|
||
let mut positions_iter = positions.iter_mut(); | ||
let Some(mut next_position) = positions_iter.next() else { | ||
return; | ||
}; | ||
|
||
let mut char_offset = 0; | ||
let mut line = 1; | ||
let mut column = 1; | ||
for (byte_offset, ch) in text.char_indices() { | ||
while next_position.byte_offset == byte_offset { | ||
next_position.output = Some(OutputPosition { | ||
char_offset, | ||
line, | ||
column, | ||
}); | ||
if let Some(p) = positions_iter.next() { | ||
next_position = p; | ||
} else { | ||
return; | ||
} | ||
} | ||
char_offset += 1; | ||
if ch == '\n' { | ||
line += 1; | ||
column = 1; | ||
} else { | ||
column += 1; | ||
} | ||
} | ||
|
||
// Offsets after the last char. | ||
loop { | ||
next_position.output = Some(OutputPosition { | ||
char_offset, | ||
line, | ||
column, | ||
}); | ||
if let Some(p) = positions_iter.next() { | ||
next_position = p; | ||
} else { | ||
return; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great if we can extract these common part into a shared library. We can put it in src/ops/functions/shared/split.rs
.
|
||
// If no separators were provided, emit whole text once. | ||
if self.regex.is_none() { | ||
return emit_single_chunk_kv(full_text, self.spec.trim, self.spec.include_empty); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function has ~40 lines of code, but I feel we don't need it.
We can reuse the logic below. Only the way to compute ranges
are different.
if self.spec.trim { | ||
while s < e && bytes[s].is_ascii_whitespace() { | ||
s += 1; | ||
} | ||
while e > s && bytes[e - 1].is_ascii_whitespace() { | ||
e -= 1; | ||
} | ||
} | ||
if self.spec.include_empty || e > s { | ||
filtered.push((s, e)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(optional) we may consider putting this filter logic into a closure, e.g. add_range
. Then in the loop above we can directly add filtered range, e.g. add_range(prev, m.start())
. We can avoid one intermediate Vec
.
btw great PR!! Thanks a lot @MrAnayDongre |
… shared split helpers; enum UPPERCASE; defaults via Python; simplify range logic
ad048d1
to
8dd43f1
Compare
Applied feedback: enum UPPERCASE, removed Rust-side defaults, extracted shared split helpers, simplified range logic. |
enum KeepSep { | ||
NONE, | ||
LEFT, | ||
RIGHT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should still use PascalCase for enum variants in Rust. That's the convention
(and that's why rename_all
is needed to change the case for serialization)
.required()?, | ||
}; | ||
|
||
// start/end structs exactly like SplitRecursively |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hints that we should factor out the common schema construction logic between SplitRecursively
and this one into shared/split.rs
too.
struct Spec { | ||
// Python SDK provides defaults/values. | ||
separators_regex: Vec<String>, | ||
keep_separator: KeepSep, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more idiomatic if we leave only 2 variants in KeepSep
(left and right) and make this one a Option<KeepSep>
. And on Python side the type can be Literal["LEFT", "RIGHT"] | None
. In this way we just reuse the None
value in both languages as the default one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for extracting the common code here!
Can we also update split_recursively.rs
accordingly to use this library and avoid duplication?
Thanks for contributing! I'll merge it for now and take over the remaining comments. |
Adds
SplitBySeparators
(regex-only, no chunk-size planning).Rust executor + registration; Python spec; schema matches
SplitRecursively
.143 passed / 1 skipped (DB). Closes #1003. cc @georgeh0