DocQA OCR vs VLM Comparison

A comprehensive evaluation framework comparing two document question-answering pipelines on 5,349 real-world questions from the DocVQA dataset:

Textract OCR → GPT-5 Text LLM: AWS Textract extracts text, then GPT-5 answers questions
GPT-5 Vision (VLM): Direct visual analysis without OCR

TL;DR - Results

Winner: VLM (by accuracy) | Textract (by cost/speed)

VLM: 90.41% EM, 92.30% F1 | $24, 13.10s/question
Textract: 85.66% EM, 88.20% F1 | $16, 5.58s/question

VLM is 4% more accurate but 1.5x more expensive and 2.3x slower. Both pipelines agree on 84% of questions. For production, a hybrid approach (Textract by default, VLM for low-confidence cases) offers the best balance.

Project Status

✅ Complete - Full Evaluation Finished

Both pipelines have been fully evaluated on the complete validation split:

Dataset: lmms-lab/DocVQA validation split (5,349 questions with proper ground truth)
Textract Pipeline: ✅ Complete (5,349/5,349 questions)
VLM Pipeline: ✅ Complete (5,349/5,349 questions)
Evaluation: ✅ Complete with comprehensive metrics

Final Results Summary

Metric	Textract (OCR → Text LLM)	VLM (Direct Vision)	Winner
Exact Match (EM)	85.66%	90.41%	VLM (+4.75%)
F1 Score	88.20%	92.30%	VLM (+4.09%)
Avg Tokens/Question	772	1,199	Textract (1.6x fewer)
Avg Latency	5.58s	13.10s	Textract (2.3x faster)
Total Cost (5,349 questions)	~$16	~$24	Textract (33% cheaper)

Key Findings:

🏆 VLM outperforms Textract by 4.09% F1 score overall
💰 Cost-Accuracy Trade-off: VLM provides better accuracy but costs 1.5x more and takes 2.3x longer
🤝 High Agreement: Both pipelines produce identical results on 84.1% of questions
📊 When They Differ: VLM wins 10.8% of questions, Textract wins 5.1%

Architecture

Pipeline 1: Textract OCR → Text LLM

Image → AWS Textract (DetectDocumentText) → OCR Text → GPT-5 → Answer

Pipeline 2: Vision Language Model

Image → GPT-5 Vision (base64 encoded) → Answer

Data Flow

scripts/01_prepare_dataset.py
    ↓ (Image & Question Manifests)
    ├─→ scripts/02_run_textract.py
    │       ↓ (OCR Results JSONL)
    │   scripts/05_answer_questions_textract.py
    │       ↓ (Answers JSONL)
    └─→ scripts/05_answer_questions_vlm.py
            ↓ (Answers JSONL)
scripts/06_evaluate_qa.py → Evaluation Report (JSON)

Quick Start

Prerequisites

# Clone the repository
git clone https://github.com/longhoag/docqa-ocr-vlm-v2.git
cd docqa-ocr-vlm-v2

# Install Poetry (if not already installed)
curl -sSL https://install.python-poetry.org | python3 -

# Install dependencies
make install

# Configure environment variables
cp .env.example .env
# Edit .env and add:
#   OPENAI_API_KEY=your_openai_api_key
#   AWS_ACCESS_KEY_ID=your_aws_key (for Textract)
#   AWS_SECRET_ACCESS_KEY=your_aws_secret (for Textract)

Running the Complete Pipeline

Step 1: Prepare Dataset (One-Time Setup)

# Downloads DocVQA validation split, extracts images, creates manifests
make prepare
# Output: 5,349 images + image/question manifests

Step 2a: Textract Pipeline

# Test on 5 samples first (recommended)
make textract-sample              # ~$0.01, 30 seconds
make answer-textract-sample       # ~$0.10, 1 minute

# Run full validation split
make textract-full                # ~$8, 2-3 hours (5,349 images)
make answer-textract-full         # ~$16, 6-8 hours (5,349 questions)

Step 2b: VLM Pipeline

# Test on 5 samples first (recommended)
make vlm-sample                   # ~$0.10, 1 minute

# Run full validation split
make vlm-full                     # ~$24, 18-24 hours (5,349 questions)

Step 3: Evaluate & Compare

# Generate comprehensive evaluation report
make evaluate
# Output: outputs/evaluation/validation_evaluation_report.json

View Results

# View evaluation summary
python -c "
import json
with open('outputs/evaluation/validation_evaluation_report.json') as f:
    data = json.load(f)
    om = data['overall_metrics']
    print(f\"Textract - EM: {om['textract']['avg_em']:.2%}, F1: {om['textract']['avg_f1']:.2%}\")
    print(f\"VLM      - EM: {om['vlm']['avg_em']:.2%}, F1: {om['vlm']['avg_f1']:.2%}\")
"

Available Make Commands

Run make help to see all available commands, or use these directly:

Setup & Preparation:

make install - Install all dependencies via Poetry
make prepare - Download dataset, extract images, create manifests (one-time)

Textract Pipeline:

make textract-sample - Process 5 validation images with Textract (test first!)
make textract-full - Process all 5,349 images (~$8, 2-3 hours)
make answer-textract-sample - Answer 5 questions using Textract+GPT-5
make answer-textract-full - Answer all 5,349 questions (~$8, 6-8 hours)

VLM Pipeline:

make vlm-sample - Process 5 questions with GPT-5 Vision (test first!)
make vlm-full - Process all 5,349 questions (~$24, 18-24 hours)

Evaluation:

make evaluate - Generate comprehensive comparison report

Cleanup:

make clean - Remove all generated data and outputs

Project Structure

config/                          # Centralized configuration
  config.py                      # Settings, paths, API keys (from .env)
scripts/                         # Pipeline scripts
  01_prepare_dataset.py          # Extract dataset → manifests + images
  02_run_textract.py             # AWS Textract OCR
  05_answer_questions_textract.py # Textract + GPT-5 text model
  05_answer_questions_vlm.py     # GPT-5 Vision model
  06_evaluate_qa.py              # Compare pipelines, generate report
utils/                           # Helper modules
  logger.py                      # Loguru setup
  cost_tracker.py                # Cost/token tracking
  manifest.py                    # JSONL read/write
data/
  raw/images/validation/         # Extracted images (5,349 PNGs, git-ignored)
  processed/manifests/           # Collaboration manifests (git-tracked)
    image/validation.jsonl       # Image paths + metadata
    question/validation.jsonl    # Questions + ground truth
outputs/                         # Pipeline outputs (JSONL format)
  textract/validation/
    validation_textract_outputs.jsonl  # All OCR results (7.7MB, git-tracked)
  answers/textract/validation/
    validation_textract_answers.jsonl  # All answers (git-tracked)
  answers/vlm/validation/
    validation_vlm_answers.jsonl       # All answers (git-tracked)
  evaluation/
    validation_evaluation_report.json  # Metrics report (git-tracked)

Output Formats

All pipeline outputs use JSONL (JSON Lines) for efficient streaming and resumability:

Textract Output (`validation_textract_outputs.jsonl`)

{"questionId": "12345", "text": "extracted text...", "blocks": [...], "runtime": 1.23, "cost": 0.0015}

Answer Output (`validation_{pipeline}_answers.jsonl`)

{
  "questionId": "12345",
  "question": "What is the total?",
  "answer": "$1,234.56",
  "confidence": 0.95,
  "ground_truth_answers": ["$1,234.56", "1234.56"],
  "question_types": ["numeric", "extraction"],
  "tokens": {"input": 247, "output": 156, "total": 403},
  "latency_seconds": 2.45,
  "pipeline": "textract",
  "model": "gpt-5"
}

Evaluation Report (`validation_evaluation_report.json`)

Overall metrics: EM, F1, tokens, latency per pipeline
Metrics by question type: breakdown by handwritten, form, figure/diagram, layout, others
Consistent wins: question types where one pipeline dominates (>70% win rate)
Example cases: significant wins, both correct, both wrong

Evaluation Metrics

QA Accuracy

Exact Match (EM): Binary score after normalization (lowercase, remove punctuation/articles)
Token-level F1: Precision/recall based on token overlap (returns max F1 vs all ground truths)
Numeric Tolerance: ±0.5% relative tolerance for numeric answers

Operational Metrics

Average/median latency per question
Token usage (input/output/total)
Estimated cost per 1,000 questions
Win distribution by question type

Full Evaluation Results (5,349 Questions)

Overall Performance:

Textract (OCR → Text LLM):
  • Exact Match: 85.66%
  • F1 Score: 88.20%
  • Avg Tokens: 772/question
  • Avg Latency: 5.58s/question
  • Total Cost: ~$16

VLM (Direct Vision):
  • Exact Match: 90.41% (+4.75%)
  • F1 Score: 92.30% (+4.09%)
  • Avg Tokens: 1,199/question (1.6x more)
  • Avg Latency: 13.10s/question (2.3x slower)
  • Total Cost: ~$24 (1.5x more expensive)

Win Distribution:

VLM wins: 576 questions (10.8%)
Textract wins: 272 questions (5.1%)
Ties: 4,501 questions (84.1%)

Key Insight: While VLM provides 4% better accuracy on average, both pipelines produce identical results in 84% of cases. When they differ, VLM outperforms Textract 2:1.

Detailed Performance Comparison

Speed vs Accuracy Trade-off

Dimension	Textract Advantage	VLM Advantage
Accuracy	-	+4.09% F1, +4.75% EM
Speed	2.3x faster (5.58s vs 13.10s)	-
Cost	33% cheaper ($16 vs $24)	-
Token Efficiency	1.6x fewer tokens (772 vs 1,199)	-

When to Use Each Pipeline

Choose Textract (OCR → Text LLM) when:

✅ High-volume processing (>10K questions/day)
✅ Cost sensitivity is critical
✅ Response time matters (real-time applications)
✅ 88% F1 accuracy is acceptable
✅ Documents are primarily typed/printed text

Choose VLM (Direct Vision) when:

✅ Maximum accuracy is required (90%+ EM)
✅ Processing handwritten or complex layouts
✅ Handling forms, diagrams, or mixed content
✅ Cost and latency are secondary concerns
✅ Questions require visual understanding beyond text

Hybrid Approach (Recommended for Production):

Use Textract as default (fast + cheap)
Route to VLM when Textract confidence < 0.7
Expected performance: ~89-90% F1 at ~$18-20 cost (optimal balance)

Key Design Decisions

Why JSONL Format?

Streaming: Process large datasets line-by-line
Resumability: Skip already-processed entries automatically
Git-friendly: Line-based format for clean diffs
Collaboration: Single consolidated file per split (not thousands of JSON files)

Why Validation Split Only?

The test split has null values for question_types and answers fields, making evaluation impossible. Only validation split (5,349 samples) has proper ground truth.

Token Limits

Textract pipeline: 4,000 max_completion_tokens (handles dense OCR text)
VLM pipeline: 6,000 max_completion_tokens (complex visual analysis)
Both include truncated JSON recovery logic

Cost Tracking

Textract: $0.0015/page (logged per API call)
GPT-5 Text: $1.25/M input tokens, $10/M output tokens
GPT-5 Vision: Higher cost due to image tokens (tracked separately)

Configuration

All settings centralized in config/config.py:

Paths (data, outputs, manifests)
API keys (loaded from .env)
Model names
Dataset splits
Sample sizes

Never hardcode - always use from config import config.

Development

Dependencies

Managed via Poetry (pyproject.toml):

Python 3.13.5
OpenAI SDK (GPT-5)
Boto3 (AWS Textract)
Datasets (Hugging Face)
Loguru (logging)

Logging

All scripts use Loguru logger (not print statements):

from loguru import logger
logger.info("Processing {}", image_path)
logger.error("API call failed: {}", error)

Error Handling

Retry logic with exponential backoff (Textract API)
Truncated JSON recovery (LLM token limits)
Empty response detection
Graceful failure with detailed error messages

Cost Estimates

Actual Costs (Full Validation Split - 5,349 Questions):

Pipeline	Component	Unit Cost	Actual Cost	Notes
Textract	OCR	$0.0015/page	~$8.00	One-time cost, 5,349 pages
Textract	GPT-5 Text	$1.25/M in, $10/M out	~$8.00	4.1M total tokens
Textract Total			~$16.00	$2.99 per 1K questions
VLM	GPT-5 Vision	$2.50/M in, $10/M out	~$24.42	6.4M total tokens
VLM Total			~$24.42	$4.57 per 1K questions

Cost-Accuracy Analysis:

VLM costs 53% more than Textract ($24.42 vs $16.00)
VLM provides 4.09% better F1 score (92.30% vs 88.20%)
Cost per accuracy point: VLM = $5.97/%, Textract = $0.18/%
For production at scale (100K questions): Textract = $299, VLM = $457

Recommendation:

Use Textract for cost-sensitive, high-volume scenarios where 88% accuracy is acceptable
Use VLM for critical applications requiring maximum accuracy (90%+ EM)
Consider hybrid approach: Use VLM only for questions where Textract has low confidence

Next Steps

Completed ✅

✅ Full Textract pipeline (5,349 questions)
✅ Full VLM pipeline (5,349 questions)
✅ Comprehensive evaluation and comparison

Future Enhancements

Question Type Analysis: Deep dive into performance by document type (handwritten, forms, diagrams)
Hybrid Pipeline: Combine both approaches using confidence-based routing
Error Analysis: Review cases where both pipelines failed
Cost Optimization: Experiment with smaller vision models (GPT-4 Vision, Claude)
Production Deployment: API endpoint with intelligent pipeline selection

Contributing

This is a collaborative project designed for distributed work:

Manifests (image/question JSONL) ensure dataset consistency
JSONL outputs are git-tracked for result sharing
Cost reports are local-only (git-ignored)
All paths relative to config.PROJECT_ROOT

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
config		config
data		data
docs		docs
outputs		outputs
scripts		scripts
utils		utils
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

License

longhoag/docqa-ocr-vlm-v2

Folders and files

Latest commit

History

Repository files navigation

DocQA OCR vs VLM Comparison

TL;DR - Results

Project Status

Final Results Summary

Architecture

Pipeline 1: Textract OCR → Text LLM

Pipeline 2: Vision Language Model

Data Flow

Quick Start

Prerequisites

Running the Complete Pipeline

View Results

Available Make Commands

Project Structure

Output Formats

Textract Output (validation_textract_outputs.jsonl)

Answer Output (validation_{pipeline}_answers.jsonl)

Evaluation Report (validation_evaluation_report.json)

Evaluation Metrics

QA Accuracy

Operational Metrics

Full Evaluation Results (5,349 Questions)

Detailed Performance Comparison

Speed vs Accuracy Trade-off

When to Use Each Pipeline

Key Design Decisions

Why JSONL Format?

Why Validation Split Only?

Token Limits

Cost Tracking

Configuration

Development

Dependencies

Logging

Error Handling

Cost Estimates

Next Steps

Completed ✅

Future Enhancements

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Textract Output (`validation_textract_outputs.jsonl`)

Answer Output (`validation_{pipeline}_answers.jsonl`)

Evaluation Report (`validation_evaluation_report.json`)

Packages