You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Document chunking for LLM processing and RAG applications
37
38
38
39
-**API Endpoints**:
39
40
- Synchronous single document conversion
40
41
- Synchronous batch document conversion
41
42
- Asynchronous single document conversion with job tracking
42
43
- Asynchronous batch conversion with job tracking
44
+
- Document chunking for completed conversion jobs
43
45
44
46
-**Processing Modes**:
45
47
- CPU-only processing for standard deployments
@@ -236,6 +238,76 @@ curl -X POST "http://localhost:8080/batch-conversion-jobs" \
236
238
-F "documents=@/path/to/document2.pdf"
237
239
```
238
240
241
+
### Document Chunking
242
+
243
+
After converting documents, you can generate text chunks optimized for LLM processing:
244
+
245
+
1. Chunk a single converted document:
246
+
247
+
```bash
248
+
curl -X GET "http://localhost:8080/conversion-jobs/{job_id}/chunks?max_tokens=512&merge_peers=true&include_page_numbers=true" \
249
+
-H "accept: application/json"
250
+
```
251
+
252
+
2. Chunk all documents from a batch conversion:
253
+
254
+
```bash
255
+
curl -X GET "http://localhost:8080/batch-conversion-jobs/{job_id}/chunks?max_tokens=512&merge_peers=true&include_page_numbers=true" \
256
+
-H "accept: application/json"
257
+
```
258
+
259
+
3. Chunk text directly (without requiring a conversion job):
260
+
261
+
```bash
262
+
curl -X POST "http://localhost:8080/text/chunk" \
263
+
-H "accept: application/json" \
264
+
-H "Content-Type: application/json" \
265
+
-d '{
266
+
"text": "This is the text content that needs to be chunked. It can be as long as needed.",
267
+
"filename": "example.txt",
268
+
"max_tokens": 512,
269
+
"merge_peers": true,
270
+
"include_page_numbers": false
271
+
}'
272
+
```
273
+
274
+
Chunking parameters:
275
+
-`max_tokens`: Maximum number of tokens per chunk (range: 64-2048, default: 512)
276
+
-`merge_peers`: Whether to merge undersized peer chunks (default: true)
277
+
-`include_page_numbers`: Whether to include page number references in chunk metadata (default: false)
278
+
279
+
#### Chunking Implementation
280
+
281
+
The API uses the Semantic Double-Pass Merging (SDPM) algorithm from the Chonkie library to produce high-quality chunks with improved context preservation. This chunker:
282
+
283
+
1. Groups content by semantic similarity
284
+
2. Merges similar groups within a skip window
285
+
3. Connects related content that may not be consecutive in the text
286
+
4. Preserves contextual relationships between different parts of the document
287
+
288
+
The chunker is particularly effective for documents with recurring themes or concepts spread throughout the text.
289
+
290
+
The response includes:
291
+
```json
292
+
{
293
+
"job_id": "the-job-id",
294
+
"filename": "document-name",
295
+
"chunks": [
296
+
{
297
+
"text": "Plain text content of the chunk without additional context",
298
+
"metadata": {
299
+
"token_count": 123,
300
+
"start_index": 0,
301
+
"end_index": 512,
302
+
"sentence_count": 5,
303
+
"page_number": 1
304
+
}
305
+
}
306
+
],
307
+
"error": null// Error message if chunking failed
308
+
}
309
+
```
310
+
239
311
## Configuration Options
240
312
241
313
-`image_resolution_scale`: Control the resolution of extracted images (1-4)
0 commit comments