Skip to content

Conversation

MrAnayDongre
Copy link
Contributor

Adds SplitBySeparators (regex-only, no chunk-size planning).
Rust executor + registration; Python spec; schema matches SplitRecursively.
143 passed / 1 skipped (DB). Closes #1003. cc @georgeh0

keep_separator: KeepSep,
#[serde(default)]
include_empty: bool,
#[serde(default = "default_trim")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't have to explicitly set these default values here.

Python SDK already has the default values. It's better to keep default values in a single place (no need to worry about out of sync).

use crate::{fields_value, ops::sdk::*};

#[derive(Serialize, Deserialize, Clone, Copy, PartialEq, Eq)]
#[serde(rename_all = "lowercase")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer UPPERCASE here (or SCREAMING_SNAKE_CASE, but no difference for this case). It's also the naming convention for enum variants in Python.

Comment on lines 40 to 117
/// Mirrors SplitRecursively’s output positions.
#[derive(Debug, Clone, PartialEq, Eq)]
struct OutputPosition {
char_offset: usize,
line: u32,
column: u32,
}
impl OutputPosition {
fn into_output(self) -> value::Value {
value::Value::Struct(fields_value!(
self.char_offset as i64,
self.line as i64,
self.column as i64
))
}
}
struct Position {
byte_offset: usize,
output: Option<OutputPosition>,
}
impl Position {
fn new(byte_offset: usize) -> Self {
Self {
byte_offset,
output: None,
}
}
}

/// Fill OutputPosition for the requested byte offsets.
fn set_output_positions<'a>(text: &str, positions: impl Iterator<Item = &'a mut Position>) {
let mut positions = positions.collect::<Vec<_>>();
positions.sort_by_key(|o| o.byte_offset);

let mut positions_iter = positions.iter_mut();
let Some(mut next_position) = positions_iter.next() else {
return;
};

let mut char_offset = 0;
let mut line = 1;
let mut column = 1;
for (byte_offset, ch) in text.char_indices() {
while next_position.byte_offset == byte_offset {
next_position.output = Some(OutputPosition {
char_offset,
line,
column,
});
if let Some(p) = positions_iter.next() {
next_position = p;
} else {
return;
}
}
char_offset += 1;
if ch == '\n' {
line += 1;
column = 1;
} else {
column += 1;
}
}

// Offsets after the last char.
loop {
next_position.output = Some(OutputPosition {
char_offset,
line,
column,
});
if let Some(p) = positions_iter.next() {
next_position = p;
} else {
return;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great if we can extract these common part into a shared library. We can put it in src/ops/functions/shared/split.rs.


// If no separators were provided, emit whole text once.
if self.regex.is_none() {
return emit_single_chunk_kv(full_text, self.spec.trim, self.spec.include_empty);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has ~40 lines of code, but I feel we don't need it.

We can reuse the logic below. Only the way to compute ranges are different.

Comment on lines 182 to 86
if self.spec.trim {
while s < e && bytes[s].is_ascii_whitespace() {
s += 1;
}
while e > s && bytes[e - 1].is_ascii_whitespace() {
e -= 1;
}
}
if self.spec.include_empty || e > s {
filtered.push((s, e));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(optional) we may consider putting this filter logic into a closure, e.g. add_range. Then in the loop above we can directly add filtered range, e.g. add_range(prev, m.start()). We can avoid one intermediate Vec.

@badmonster0
Copy link
Member

btw great PR!! Thanks a lot @MrAnayDongre

… shared split helpers; enum UPPERCASE; defaults via Python; simplify range logic
@MrAnayDongre MrAnayDongre force-pushed the feat/split-by-separators branch from ad048d1 to 8dd43f1 Compare September 17, 2025 17:32
@MrAnayDongre
Copy link
Contributor Author

Applied feedback: enum UPPERCASE, removed Rust-side defaults, extracted shared split helpers, simplified range logic.
Re-tested: 143/1 + cargo tests green

enum KeepSep {
NONE,
LEFT,
RIGHT,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still use PascalCase for enum variants in Rust. That's the convention

(and that's why rename_all is needed to change the case for serialization)

.required()?,
};

// start/end structs exactly like SplitRecursively
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hints that we should factor out the common schema construction logic between SplitRecursively and this one into shared/split.rs too.

struct Spec {
// Python SDK provides defaults/values.
separators_regex: Vec<String>,
keep_separator: KeepSep,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more idiomatic if we leave only 2 variants in KeepSep (left and right) and make this one a Option<KeepSep>. And on Python side the type can be Literal["LEFT", "RIGHT"] | None. In this way we just reuse the None value in both languages as the default one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for extracting the common code here!

Can we also update split_recursively.rs accordingly to use this library and avoid duplication?

@georgeh0
Copy link
Member

Thanks for contributing! I'll merge it for now and take over the remaining comments.

@georgeh0 georgeh0 merged commit 0353098 into cocoindex-io:main Sep 27, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Create a simple SplitBySeparators function (don't account chunk size)
3 participants