A powerful Rust library for extracting structured content from PDF files with precise positioning data and intelligent chunking for RAG (Retrieval-Augmented Generation) applications.
- Advanced Text Extraction - Extract text with precise positioning and font information
- Intelligent Chunking - Token-aware splitting optimized for LLM consumption
- Header/Footer Detection - Automatic identification and filtering of repetitive content
- Multi-page Support - Track content spanning multiple pages with detailed fragments
- OCR Integration - Built-in OCR support for scanned documents
- Structured Schema - Modern schema with compressed metadata for efficient storage
- Search Highlighting - Precise bounding boxes and quads for visual highlighting
- Production Ready - Optimized for high-throughput document processing
Add this to your Cargo.toml
:
[dependencies]
pdf-extract = "0.7.7"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
use pdf_extract::*;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Extract content from PDF file
let results = parse_pdf("document.pdf", 1, "file", None, None, None, Some(500))?;
for result in results {
println!("Content: {}", result.content_core.content);
println!("Tokens: {}", result.content_core.token_count);
// Extract PDF-specific location data
let pdf_location = extract_pdf_location(&result.content_ext)?;
println!("Pages: {} to {}",
pdf_location.page_range.start,
pdf_location.page_range.end);
}
Ok(())
}
use pdf_extract::*;
let results = parse_pdf(
"scanned_document.pdf",
1, // source_id
"file", // source_type
Some(true), // enable OCR
None, // detection model (uses default)
None, // recognition model (uses default)
Some(1000) // max tokens per chunk
)?;
The library returns structured data in two main components:
The primary content structure:
pub struct ContentCore {
pub chunk_id: String, // blake3 hash of content
pub source_id: i64, // your source identifier
pub source_type: String, // "file" | "web" | "api"
pub content: String, // extracted text
pub token_count: i32, // estimated token count
pub headings_json: Option<String>, // hierarchical headings
pub status: String, // extraction status
pub schema_version: i32, // for future compatibility
pub created_at: i64, // unix timestamp
}
Compressed metadata with positioning information:
pub struct ContentExt {
pub chunk_id: String,
pub ext_json: Vec<u8>, // zstd compressed metadata
}
The compressed metadata includes:
- PDF Location Data - Page ranges, character positions, bounding boxes
- Fragment Information - Per-page positioning with quads for highlighting
- Text Flow - Reading order and layout type detection
Extract exact locations for search highlighting:
let pdf_location = extract_pdf_location(&result.content_ext)?;
for fragment in pdf_location.fragments {
println!("Page {}: chars {}-{}",
fragment.page,
fragment.char_range.start,
fragment.char_range.end);
// Use quads for precise highlighting
for quad in fragment.quads {
println!("Highlight area: ({},{}) to ({},{})",
quad.x1, quad.y1, quad.x3, quad.y3);
}
}
Intelligent splitting respects token limits:
// Chunks will be optimally split to stay under 500 tokens
let results = parse_pdf("large_document.pdf", 1, "file", None, None, None, Some(500))?;
for result in results {
assert!(result.content_core.token_count <= 500);
}
Access document structure:
if let Some(headings_json) = &result.content_core.headings_json {
let headings: Vec<String> = serde_json::from_str(headings_json)?;
println!("Section: {}", headings.join(" > "));
}
Decompress and analyze metadata:
let metadata = decompress_content_ext(&result.content_ext)?;
if let Some(extraction_meta) = metadata.get("extraction_metadata") {
if let Some(bbox) = extraction_meta.get("bbox") {
println!("Bounding box: {:?}", bbox);
}
}
- Batch Processing - Process multiple files in parallel
- Token Limits - Use appropriate token limits for your use case
- OCR Selective - Only enable OCR for scanned documents
- Compression - Metadata is automatically compressed with zstd
// Store in your database
struct DocumentChunk {
id: String,
source_id: i64,
content: String,
token_count: i32,
metadata: Vec<u8>, // compressed ContentExt.ext_json
created_at: i64,
}
let chunk = DocumentChunk {
id: result.content_core.chunk_id,
source_id: result.content_core.source_id,
content: result.content_core.content,
token_count: result.content_core.token_count,
metadata: result.content_ext.ext_json,
created_at: result.content_core.created_at,
};
// Prepare for embedding
let chunks: Vec<String> = results.iter()
.map(|r| r.content_core.content.clone())
.collect();
// Generate embeddings and store with chunk_id as reference
parse_pdf()
- Main extraction functionextract_pdf_location()
- Extract PDF positioning datadecompress_content_ext()
- Decompress metadatacreate_content_core()
- Create ContentCore structurecreate_content_ext()
- Create ContentExt structure
ExtractionResult
- Combined content and metadataContentCore
- Primary content structureContentExt
- Compressed metadataPdfLocation
- PDF-specific positioningPageFragment
- Per-page position dataFormatLocation
- Multi-format location enum
let results = parse_pdf(
"document.pdf",
1,
"file",
Some(true), // enable OCR
Some("path/to/detection.onnx".to_string()), // custom detection model
Some("path/to/recognition.onnx".to_string()), // custom recognition model
Some(500)
)?;
Some(500)
- Split at ~500 tokensSome(1000)
- Split at ~1000 tokensNone
- Natural paragraph boundaries
We welcome contributions! Please see our Integration Guide for production deployment examples.
MIT License - see LICENSE file for details.
- PDFExtract - Alternative PDF extraction
- pdfminer - Python PDF mining tool
- marker - PDF to markdown converter
- layout-parser - Document layout analysis