A comprehensive evaluation framework comparing two document question-answering pipelines on 5,349 real-world questions from the DocVQA dataset:
- Textract OCR → GPT-5 Text LLM: AWS Textract extracts text, then GPT-5 answers questions
- GPT-5 Vision (VLM): Direct visual analysis without OCR
Winner: VLM (by accuracy) | Textract (by cost/speed)
- VLM: 90.41% EM, 92.30% F1 | $24, 13.10s/question
- Textract: 85.66% EM, 88.20% F1 | $16, 5.58s/question
VLM is 4% more accurate but 1.5x more expensive and 2.3x slower. Both pipelines agree on 84% of questions. For production, a hybrid approach (Textract by default, VLM for low-confidence cases) offers the best balance.
✅ Complete - Full Evaluation Finished
Both pipelines have been fully evaluated on the complete validation split:
- Dataset:
lmms-lab/DocVQAvalidation split (5,349 questions with proper ground truth) - Textract Pipeline: ✅ Complete (5,349/5,349 questions)
- VLM Pipeline: ✅ Complete (5,349/5,349 questions)
- Evaluation: ✅ Complete with comprehensive metrics
| Metric | Textract (OCR → Text LLM) | VLM (Direct Vision) | Winner |
|---|---|---|---|
| Exact Match (EM) | 85.66% | 90.41% | VLM (+4.75%) |
| F1 Score | 88.20% | 92.30% | VLM (+4.09%) |
| Avg Tokens/Question | 772 | 1,199 | Textract (1.6x fewer) |
| Avg Latency | 5.58s | 13.10s | Textract (2.3x faster) |
| Total Cost (5,349 questions) | ~$16 | ~$24 | Textract (33% cheaper) |
Key Findings:
- 🏆 VLM outperforms Textract by 4.09% F1 score overall
- 💰 Cost-Accuracy Trade-off: VLM provides better accuracy but costs 1.5x more and takes 2.3x longer
- 🤝 High Agreement: Both pipelines produce identical results on 84.1% of questions
- 📊 When They Differ: VLM wins 10.8% of questions, Textract wins 5.1%
Image → AWS Textract (DetectDocumentText) → OCR Text → GPT-5 → Answer
Image → GPT-5 Vision (base64 encoded) → Answer
scripts/01_prepare_dataset.py
↓ (Image & Question Manifests)
├─→ scripts/02_run_textract.py
│ ↓ (OCR Results JSONL)
│ scripts/05_answer_questions_textract.py
│ ↓ (Answers JSONL)
└─→ scripts/05_answer_questions_vlm.py
↓ (Answers JSONL)
scripts/06_evaluate_qa.py → Evaluation Report (JSON)
# Clone the repository
git clone https://github.com/longhoag/docqa-ocr-vlm-v2.git
cd docqa-ocr-vlm-v2
# Install Poetry (if not already installed)
curl -sSL https://install.python-poetry.org | python3 -
# Install dependencies
make install
# Configure environment variables
cp .env.example .env
# Edit .env and add:
# OPENAI_API_KEY=your_openai_api_key
# AWS_ACCESS_KEY_ID=your_aws_key (for Textract)
# AWS_SECRET_ACCESS_KEY=your_aws_secret (for Textract)Step 1: Prepare Dataset (One-Time Setup)
# Downloads DocVQA validation split, extracts images, creates manifests
make prepare
# Output: 5,349 images + image/question manifestsStep 2a: Textract Pipeline
# Test on 5 samples first (recommended)
make textract-sample # ~$0.01, 30 seconds
make answer-textract-sample # ~$0.10, 1 minute
# Run full validation split
make textract-full # ~$8, 2-3 hours (5,349 images)
make answer-textract-full # ~$16, 6-8 hours (5,349 questions)Step 2b: VLM Pipeline
# Test on 5 samples first (recommended)
make vlm-sample # ~$0.10, 1 minute
# Run full validation split
make vlm-full # ~$24, 18-24 hours (5,349 questions)Step 3: Evaluate & Compare
# Generate comprehensive evaluation report
make evaluate
# Output: outputs/evaluation/validation_evaluation_report.json# View evaluation summary
python -c "
import json
with open('outputs/evaluation/validation_evaluation_report.json') as f:
data = json.load(f)
om = data['overall_metrics']
print(f\"Textract - EM: {om['textract']['avg_em']:.2%}, F1: {om['textract']['avg_f1']:.2%}\")
print(f\"VLM - EM: {om['vlm']['avg_em']:.2%}, F1: {om['vlm']['avg_f1']:.2%}\")
"Run make help to see all available commands, or use these directly:
Setup & Preparation:
make install- Install all dependencies via Poetrymake prepare- Download dataset, extract images, create manifests (one-time)
Textract Pipeline:
make textract-sample- Process 5 validation images with Textract (test first!)make textract-full- Process all 5,349 images (~$8, 2-3 hours)make answer-textract-sample- Answer 5 questions using Textract+GPT-5make answer-textract-full- Answer all 5,349 questions (~$8, 6-8 hours)
VLM Pipeline:
make vlm-sample- Process 5 questions with GPT-5 Vision (test first!)make vlm-full- Process all 5,349 questions (~$24, 18-24 hours)
Evaluation:
make evaluate- Generate comprehensive comparison report
Cleanup:
make clean- Remove all generated data and outputs
config/ # Centralized configuration
config.py # Settings, paths, API keys (from .env)
scripts/ # Pipeline scripts
01_prepare_dataset.py # Extract dataset → manifests + images
02_run_textract.py # AWS Textract OCR
05_answer_questions_textract.py # Textract + GPT-5 text model
05_answer_questions_vlm.py # GPT-5 Vision model
06_evaluate_qa.py # Compare pipelines, generate report
utils/ # Helper modules
logger.py # Loguru setup
cost_tracker.py # Cost/token tracking
manifest.py # JSONL read/write
data/
raw/images/validation/ # Extracted images (5,349 PNGs, git-ignored)
processed/manifests/ # Collaboration manifests (git-tracked)
image/validation.jsonl # Image paths + metadata
question/validation.jsonl # Questions + ground truth
outputs/ # Pipeline outputs (JSONL format)
textract/validation/
validation_textract_outputs.jsonl # All OCR results (7.7MB, git-tracked)
answers/textract/validation/
validation_textract_answers.jsonl # All answers (git-tracked)
answers/vlm/validation/
validation_vlm_answers.jsonl # All answers (git-tracked)
evaluation/
validation_evaluation_report.json # Metrics report (git-tracked)
All pipeline outputs use JSONL (JSON Lines) for efficient streaming and resumability:
{"questionId": "12345", "text": "extracted text...", "blocks": [...], "runtime": 1.23, "cost": 0.0015}{
"questionId": "12345",
"question": "What is the total?",
"answer": "$1,234.56",
"confidence": 0.95,
"ground_truth_answers": ["$1,234.56", "1234.56"],
"question_types": ["numeric", "extraction"],
"tokens": {"input": 247, "output": 156, "total": 403},
"latency_seconds": 2.45,
"pipeline": "textract",
"model": "gpt-5"
}- Overall metrics: EM, F1, tokens, latency per pipeline
- Metrics by question type: breakdown by handwritten, form, figure/diagram, layout, others
- Consistent wins: question types where one pipeline dominates (>70% win rate)
- Example cases: significant wins, both correct, both wrong
- Exact Match (EM): Binary score after normalization (lowercase, remove punctuation/articles)
- Token-level F1: Precision/recall based on token overlap (returns max F1 vs all ground truths)
- Numeric Tolerance: ±0.5% relative tolerance for numeric answers
- Average/median latency per question
- Token usage (input/output/total)
- Estimated cost per 1,000 questions
- Win distribution by question type
Overall Performance:
Textract (OCR → Text LLM):
• Exact Match: 85.66%
• F1 Score: 88.20%
• Avg Tokens: 772/question
• Avg Latency: 5.58s/question
• Total Cost: ~$16
VLM (Direct Vision):
• Exact Match: 90.41% (+4.75%)
• F1 Score: 92.30% (+4.09%)
• Avg Tokens: 1,199/question (1.6x more)
• Avg Latency: 13.10s/question (2.3x slower)
• Total Cost: ~$24 (1.5x more expensive)
Win Distribution:
- VLM wins: 576 questions (10.8%)
- Textract wins: 272 questions (5.1%)
- Ties: 4,501 questions (84.1%)
Key Insight: While VLM provides 4% better accuracy on average, both pipelines produce identical results in 84% of cases. When they differ, VLM outperforms Textract 2:1.
| Dimension | Textract Advantage | VLM Advantage |
|---|---|---|
| Accuracy | - | +4.09% F1, +4.75% EM |
| Speed | 2.3x faster (5.58s vs 13.10s) | - |
| Cost | 33% cheaper ($16 vs $24) | - |
| Token Efficiency | 1.6x fewer tokens (772 vs 1,199) | - |
Choose Textract (OCR → Text LLM) when:
- ✅ High-volume processing (>10K questions/day)
- ✅ Cost sensitivity is critical
- ✅ Response time matters (real-time applications)
- ✅ 88% F1 accuracy is acceptable
- ✅ Documents are primarily typed/printed text
Choose VLM (Direct Vision) when:
- ✅ Maximum accuracy is required (90%+ EM)
- ✅ Processing handwritten or complex layouts
- ✅ Handling forms, diagrams, or mixed content
- ✅ Cost and latency are secondary concerns
- ✅ Questions require visual understanding beyond text
Hybrid Approach (Recommended for Production):
- Use Textract as default (fast + cheap)
- Route to VLM when Textract confidence < 0.7
- Expected performance: ~89-90% F1 at ~$18-20 cost (optimal balance)
- Streaming: Process large datasets line-by-line
- Resumability: Skip already-processed entries automatically
- Git-friendly: Line-based format for clean diffs
- Collaboration: Single consolidated file per split (not thousands of JSON files)
The test split has null values for question_types and answers fields, making evaluation impossible. Only validation split (5,349 samples) has proper ground truth.
- Textract pipeline: 4,000 max_completion_tokens (handles dense OCR text)
- VLM pipeline: 6,000 max_completion_tokens (complex visual analysis)
- Both include truncated JSON recovery logic
- Textract: $0.0015/page (logged per API call)
- GPT-5 Text: $1.25/M input tokens, $10/M output tokens
- GPT-5 Vision: Higher cost due to image tokens (tracked separately)
All settings centralized in config/config.py:
- Paths (data, outputs, manifests)
- API keys (loaded from
.env) - Model names
- Dataset splits
- Sample sizes
Never hardcode - always use from config import config.
Managed via Poetry (pyproject.toml):
- Python 3.13.5
- OpenAI SDK (GPT-5)
- Boto3 (AWS Textract)
- Datasets (Hugging Face)
- Loguru (logging)
All scripts use Loguru logger (not print statements):
from loguru import logger
logger.info("Processing {}", image_path)
logger.error("API call failed: {}", error)- Retry logic with exponential backoff (Textract API)
- Truncated JSON recovery (LLM token limits)
- Empty response detection
- Graceful failure with detailed error messages
Actual Costs (Full Validation Split - 5,349 Questions):
| Pipeline | Component | Unit Cost | Actual Cost | Notes |
|---|---|---|---|---|
| Textract | OCR | $0.0015/page | ~$8.00 | One-time cost, 5,349 pages |
| Textract | GPT-5 Text | $1.25/M in, $10/M out | ~$8.00 | 4.1M total tokens |
| Textract Total | ~$16.00 | $2.99 per 1K questions | ||
| VLM | GPT-5 Vision | $2.50/M in, $10/M out | ~$24.42 | 6.4M total tokens |
| VLM Total | ~$24.42 | $4.57 per 1K questions |
Cost-Accuracy Analysis:
- VLM costs 53% more than Textract ($24.42 vs $16.00)
- VLM provides 4.09% better F1 score (92.30% vs 88.20%)
- Cost per accuracy point: VLM = $5.97/%, Textract = $0.18/%
- For production at scale (100K questions): Textract = $299, VLM = $457
Recommendation:
- Use Textract for cost-sensitive, high-volume scenarios where 88% accuracy is acceptable
- Use VLM for critical applications requiring maximum accuracy (90%+ EM)
- Consider hybrid approach: Use VLM only for questions where Textract has low confidence
- ✅ Full Textract pipeline (5,349 questions)
- ✅ Full VLM pipeline (5,349 questions)
- ✅ Comprehensive evaluation and comparison
- Question Type Analysis: Deep dive into performance by document type (handwritten, forms, diagrams)
- Hybrid Pipeline: Combine both approaches using confidence-based routing
- Error Analysis: Review cases where both pipelines failed
- Cost Optimization: Experiment with smaller vision models (GPT-4 Vision, Claude)
- Production Deployment: API endpoint with intelligent pipeline selection
This is a collaborative project designed for distributed work:
- Manifests (image/question JSONL) ensure dataset consistency
- JSONL outputs are git-tracked for result sharing
- Cost reports are local-only (git-ignored)
- All paths relative to
config.PROJECT_ROOT
MIT License