Skip to content

Commit a51531c

Browse files
authored
Merge pull request #2 from spa5k/chunking
Chunking
2 parents f8fc3fd + 55ace1c commit a51531c

File tree

7 files changed

+1083
-152
lines changed

7 files changed

+1083
-152
lines changed

Dockerfile

Lines changed: 28 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -8,16 +8,10 @@ RUN apt-get update && \
88
apt-get install -y --no-install-recommends libgl1 libglib2.0-0 && \
99
rm -rf /var/lib/apt/lists/*
1010

11-
# Enable bytecode compilation and set proper link mode for cache mounting
12-
ENV UV_COMPILE_BYTECODE=1 \
13-
UV_LINK_MODE=copy \
14-
HF_HOME=/app/.cache/huggingface \
15-
TORCH_HOME=/app/.cache/torch \
16-
PYTHONPATH=/app \
17-
OMP_NUM_THREADS=4
18-
19-
# Copy dependency files and README
20-
COPY pyproject.toml uv.lock README.md ./
11+
# Copy only dependency files and create a dummy README
12+
COPY pyproject.toml uv.lock ./
13+
# Create a dummy README.md file to satisfy package requirements
14+
RUN echo "# Placeholder README" > README.md
2115

2216
# Install dependencies but not the project itself
2317
RUN --mount=type=cache,target=/root/.cache/uv \
@@ -46,9 +40,12 @@ RUN ARCH=$(uname -m) && \
4640
uv pip install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121; \
4741
fi
4842

49-
# Install the project in non-editable mode
50-
RUN --mount=type=cache,target=/root/.cache/uv \
51-
uv sync --frozen --no-editable
43+
# Download models
44+
RUN . /app/.venv/bin/activate && \
45+
mkdir -p /app/.cache && \
46+
python -c 'from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline; artifacts_path = StandardPdfPipeline.download_models_hf(force=True);' && \
47+
python -c 'import easyocr; reader = easyocr.Reader(["fr", "de", "es", "en", "it", "pt"], gpu=True); print("EasyOCR models downloaded successfully")' && \
48+
python -c 'from chonkie import SDPMChunker; chunker = SDPMChunker(embedding_model="minishlab/potion-base-8M"); print("Chonkie models downloaded successfully")'
5249

5350
# Download models for the pipeline
5451
RUN uv run python -c "from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline; artifacts_path = StandardPdfPipeline.download_models_hf(force=True)"
@@ -62,6 +59,8 @@ RUN ARCH=$(uname -m) && \
6259
echo "Downloading EasyOCR models with GPU support" && \
6360
uv run python -c "import easyocr; reader = easyocr.Reader(['fr', 'de', 'es', 'en', 'it', 'pt'], gpu=True); print('EasyOCR GPU models downloaded successfully')"; \
6461
fi
62+
63+
RUN uv run python -c 'from chonkie import SDPMChunker; chunker = SDPMChunker(embedding_model="minishlab/potion-base-8M"); print("Chonkie models downloaded successfully")'
6564

6665
# Production stage
6766
FROM ghcr.io/astral-sh/uv:python3.12-bookworm-slim
@@ -72,8 +71,22 @@ RUN apt-get update && \
7271
apt-get install -y --no-install-recommends redis-server libgl1 libglib2.0-0 curl && \
7372
rm -rf /var/lib/apt/lists/*
7473

75-
# Set environment variables
76-
ENV HF_HOME=/app/.cache/huggingface \
74+
# Copy model cache from builder - this rarely changes
75+
COPY --from=builder --chown=app:app /app/.cache /app/.cache/
76+
COPY --from=builder --chown=app:app /app/.venv /app/.venv/
77+
78+
# Create dummy README and copy dependency files
79+
RUN echo "# Placeholder README" > README.md
80+
COPY --chown=app:app pyproject.toml uv.lock ./
81+
82+
# Copy project files from disk
83+
COPY --chown=app:app document_converter/ ./document_converter/
84+
COPY --chown=app:app worker/ ./worker/
85+
COPY --chown=app:app main.py ./
86+
87+
# Set up Python environment
88+
ENV PYTHONPATH=/app \
89+
HF_HOME=/app/.cache/huggingface \
7790
TORCH_HOME=/app/.cache/torch \
7891
PYTHONPATH=/app \
7992
OMP_NUM_THREADS=4 \

README.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,12 +34,14 @@
3434
- Image extraction and processing
3535
- Multi-language OCR support (French, German, Spanish, English, Italian, Portuguese etc)
3636
- Configurable image resolution scaling
37+
- Document chunking for LLM processing and RAG applications
3738

3839
- **API Endpoints**:
3940
- Synchronous single document conversion
4041
- Synchronous batch document conversion
4142
- Asynchronous single document conversion with job tracking
4243
- Asynchronous batch conversion with job tracking
44+
- Document chunking for completed conversion jobs
4345

4446
- **Processing Modes**:
4547
- CPU-only processing for standard deployments
@@ -236,6 +238,76 @@ curl -X POST "http://localhost:8080/batch-conversion-jobs" \
236238
-F "documents=@/path/to/document2.pdf"
237239
```
238240

241+
### Document Chunking
242+
243+
After converting documents, you can generate text chunks optimized for LLM processing:
244+
245+
1. Chunk a single converted document:
246+
247+
```bash
248+
curl -X GET "http://localhost:8080/conversion-jobs/{job_id}/chunks?max_tokens=512&merge_peers=true&include_page_numbers=true" \
249+
-H "accept: application/json"
250+
```
251+
252+
2. Chunk all documents from a batch conversion:
253+
254+
```bash
255+
curl -X GET "http://localhost:8080/batch-conversion-jobs/{job_id}/chunks?max_tokens=512&merge_peers=true&include_page_numbers=true" \
256+
-H "accept: application/json"
257+
```
258+
259+
3. Chunk text directly (without requiring a conversion job):
260+
261+
```bash
262+
curl -X POST "http://localhost:8080/text/chunk" \
263+
-H "accept: application/json" \
264+
-H "Content-Type: application/json" \
265+
-d '{
266+
"text": "This is the text content that needs to be chunked. It can be as long as needed.",
267+
"filename": "example.txt",
268+
"max_tokens": 512,
269+
"merge_peers": true,
270+
"include_page_numbers": false
271+
}'
272+
```
273+
274+
Chunking parameters:
275+
- `max_tokens`: Maximum number of tokens per chunk (range: 64-2048, default: 512)
276+
- `merge_peers`: Whether to merge undersized peer chunks (default: true)
277+
- `include_page_numbers`: Whether to include page number references in chunk metadata (default: false)
278+
279+
#### Chunking Implementation
280+
281+
The API uses the Semantic Double-Pass Merging (SDPM) algorithm from the Chonkie library to produce high-quality chunks with improved context preservation. This chunker:
282+
283+
1. Groups content by semantic similarity
284+
2. Merges similar groups within a skip window
285+
3. Connects related content that may not be consecutive in the text
286+
4. Preserves contextual relationships between different parts of the document
287+
288+
The chunker is particularly effective for documents with recurring themes or concepts spread throughout the text.
289+
290+
The response includes:
291+
```json
292+
{
293+
"job_id": "the-job-id",
294+
"filename": "document-name",
295+
"chunks": [
296+
{
297+
"text": "Plain text content of the chunk without additional context",
298+
"metadata": {
299+
"token_count": 123,
300+
"start_index": 0,
301+
"end_index": 512,
302+
"sentence_count": 5,
303+
"page_number": 1
304+
}
305+
}
306+
],
307+
"error": null // Error message if chunking failed
308+
}
309+
```
310+
239311
## Configuration Options
240312

241313
- `image_resolution_scale`: Control the resolution of extracted images (1-4)

0 commit comments

Comments
 (0)