feat(ops): SplitBySeparators — regex-only splitter (Rust+Py) #1010

MrAnayDongre · 2025-09-17T03:52:14Z

Adds SplitBySeparators (regex-only, no chunk-size planning).
Rust executor + registration; Python spec; schema matches SplitRecursively.
143 passed / 1 skipped (DB). Closes #1003. cc @georgeh0

georgeh0 · 2025-09-17T04:08:49Z

src/ops/functions/split_by_separators.rs

+    keep_separator: KeepSep,
+    #[serde(default)]
+    include_empty: bool,
+    #[serde(default = "default_trim")]


I guess we don't have to explicitly set these default values here.

Python SDK already has the default values. It's better to keep default values in a single place (no need to worry about out of sync).

georgeh0 · 2025-09-17T04:10:33Z

src/ops/functions/split_by_separators.rs

+use crate::{fields_value, ops::sdk::*};
+
+#[derive(Serialize, Deserialize, Clone, Copy, PartialEq, Eq)]
+#[serde(rename_all = "lowercase")]


Prefer UPPERCASE here (or SCREAMING_SNAKE_CASE, but no difference for this case). It's also the naming convention for enum variants in Python.

georgeh0 · 2025-09-17T04:14:55Z

src/ops/functions/split_by_separators.rs

+/// Mirrors SplitRecursively’s output positions.
+#[derive(Debug, Clone, PartialEq, Eq)]
+struct OutputPosition {
+    char_offset: usize,
+    line: u32,
+    column: u32,
+}
+impl OutputPosition {
+    fn into_output(self) -> value::Value {
+        value::Value::Struct(fields_value!(
+            self.char_offset as i64,
+            self.line as i64,
+            self.column as i64
+        ))
+    }
+}
+struct Position {
+    byte_offset: usize,
+    output: Option<OutputPosition>,
+}
+impl Position {
+    fn new(byte_offset: usize) -> Self {
+        Self {
+            byte_offset,
+            output: None,
+        }
+    }
+}
+
+/// Fill OutputPosition for the requested byte offsets.
+fn set_output_positions<'a>(text: &str, positions: impl Iterator<Item = &'a mut Position>) {
+    let mut positions = positions.collect::<Vec<_>>();
+    positions.sort_by_key(|o| o.byte_offset);
+
+    let mut positions_iter = positions.iter_mut();
+    let Some(mut next_position) = positions_iter.next() else {
+        return;
+    };
+
+    let mut char_offset = 0;
+    let mut line = 1;
+    let mut column = 1;
+    for (byte_offset, ch) in text.char_indices() {
+        while next_position.byte_offset == byte_offset {
+            next_position.output = Some(OutputPosition {
+                char_offset,
+                line,
+                column,
+            });
+            if let Some(p) = positions_iter.next() {
+                next_position = p;
+            } else {
+                return;
+            }
+        }
+        char_offset += 1;
+        if ch == '\n' {
+            line += 1;
+            column = 1;
+        } else {
+            column += 1;
+        }
+    }
+
+    // Offsets after the last char.
+    loop {
+        next_position.output = Some(OutputPosition {
+            char_offset,
+            line,
+            column,
+        });
+        if let Some(p) = positions_iter.next() {
+            next_position = p;
+        } else {
+            return;
+        }
+    }
+}


It's great if we can extract these common part into a shared library. We can put it in src/ops/functions/shared/split.rs.

georgeh0 · 2025-09-17T05:09:46Z

src/ops/functions/split_by_separators.rs

+
+        // If no separators were provided, emit whole text once.
+        if self.regex.is_none() {
+            return emit_single_chunk_kv(full_text, self.spec.trim, self.spec.include_empty);


This function has ~40 lines of code, but I feel we don't need it.

We can reuse the logic below. Only the way to compute ranges are different.

georgeh0 · 2025-09-17T05:12:41Z

src/ops/functions/split_by_separators.rs

+            if self.spec.trim {
+                while s < e && bytes[s].is_ascii_whitespace() {
+                    s += 1;
+                }
+                while e > s && bytes[e - 1].is_ascii_whitespace() {
+                    e -= 1;
+                }
+            }
+            if self.spec.include_empty || e > s {
+                filtered.push((s, e));
+            }


(optional) we may consider putting this filter logic into a closure, e.g. add_range. Then in the loop above we can directly add filtered range, e.g. add_range(prev, m.start()). We can avoid one intermediate Vec.

badmonster0 · 2025-09-17T05:44:00Z

btw great PR!! Thanks a lot @MrAnayDongre

… shared split helpers; enum UPPERCASE; defaults via Python; simplify range logic

MrAnayDongre · 2025-09-17T17:36:38Z

Applied feedback: enum UPPERCASE, removed Rust-side defaults, extracted shared split helpers, simplified range logic.
Re-tested: 143/1 + cargo tests green

georgeh0 · 2025-09-17T17:41:53Z

src/ops/functions/split_by_separators.rs

+enum KeepSep {
+    NONE,
+    LEFT,
+    RIGHT,


We should still use PascalCase for enum variants in Rust. That's the convention

(and that's why rename_all is needed to change the case for serialization)

georgeh0 · 2025-09-17T17:50:10Z

src/ops/functions/split_by_separators.rs

+                .required()?,
+        };
+
+        // start/end structs exactly like SplitRecursively


This hints that we should factor out the common schema construction logic between SplitRecursively and this one into shared/split.rs too.

georgeh0 · 2025-09-17T17:55:07Z

src/ops/functions/split_by_separators.rs

+struct Spec {
+    // Python SDK provides defaults/values.
+    separators_regex: Vec<String>,
+    keep_separator: KeepSep,


I think it's more idiomatic if we leave only 2 variants in KeepSep (left and right) and make this one a Option<KeepSep>. And on Python side the type can be Literal["LEFT", "RIGHT"] | None. In this way we just reuse the None value in both languages as the default one.

georgeh0 · 2025-09-17T17:57:01Z

src/ops/shared/split.rs

Thanks for extracting the common code here!

Can we also update split_recursively.rs accordingly to use this library and avoid duplication?

georgeh0 · 2025-09-27T06:18:41Z

Thanks for contributing! I'll merge it for now and take over the remaining comments.

georgeh0 reviewed Sep 17, 2025

View reviewed changes

feat(ops): SplitBySeparators — regex-only splitter (Rust+Py); extract…

8dd43f1

… shared split helpers; enum UPPERCASE; defaults via Python; simplify range logic

MrAnayDongre force-pushed the feat/split-by-separators branch from ad048d1 to 8dd43f1 Compare September 17, 2025 17:32

georgeh0 reviewed Sep 17, 2025

View reviewed changes

georgeh0 merged commit 0353098 into cocoindex-io:main Sep 27, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ops): SplitBySeparators — regex-only splitter (Rust+Py) #1010

feat(ops): SplitBySeparators — regex-only splitter (Rust+Py) #1010

Uh oh!

MrAnayDongre commented Sep 17, 2025

Uh oh!

georgeh0 Sep 17, 2025

Uh oh!

georgeh0 Sep 17, 2025

Uh oh!

georgeh0 Sep 17, 2025

Uh oh!

georgeh0 Sep 17, 2025

Uh oh!

georgeh0 Sep 17, 2025

Uh oh!

badmonster0 commented Sep 17, 2025

Uh oh!

MrAnayDongre commented Sep 17, 2025

Uh oh!

georgeh0 Sep 17, 2025

Uh oh!

georgeh0 Sep 17, 2025

Uh oh!

georgeh0 Sep 17, 2025

Uh oh!

georgeh0 Sep 17, 2025

Uh oh!

georgeh0 commented Sep 27, 2025

Uh oh!

Uh oh!

Uh oh!

feat(ops): SplitBySeparators — regex-only splitter (Rust+Py) #1010

feat(ops): SplitBySeparators — regex-only splitter (Rust+Py) #1010

Uh oh!

Conversation

MrAnayDongre commented Sep 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

badmonster0 commented Sep 17, 2025

Uh oh!

MrAnayDongre commented Sep 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

georgeh0 commented Sep 27, 2025

Uh oh!

Uh oh!

Uh oh!