Skip to content

DocQA pipelines (OCR with AWS Textract and VLM) research showing performance and efficiency of the two pipelines

License

Notifications You must be signed in to change notification settings

longhoag/docqa-ocr-vlm-v2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocQA OCR vs VLM Comparison

A comprehensive evaluation framework comparing two document question-answering pipelines on 5,349 real-world questions from the DocVQA dataset:

  1. Textract OCR → GPT-5 Text LLM: AWS Textract extracts text, then GPT-5 answers questions
  2. GPT-5 Vision (VLM): Direct visual analysis without OCR

TL;DR - Results

Winner: VLM (by accuracy) | Textract (by cost/speed)

  • VLM: 90.41% EM, 92.30% F1 | $24, 13.10s/question
  • Textract: 85.66% EM, 88.20% F1 | $16, 5.58s/question

VLM is 4% more accurate but 1.5x more expensive and 2.3x slower. Both pipelines agree on 84% of questions. For production, a hybrid approach (Textract by default, VLM for low-confidence cases) offers the best balance.

Project Status

Complete - Full Evaluation Finished

Both pipelines have been fully evaluated on the complete validation split:

  • Dataset: lmms-lab/DocVQA validation split (5,349 questions with proper ground truth)
  • Textract Pipeline: ✅ Complete (5,349/5,349 questions)
  • VLM Pipeline: ✅ Complete (5,349/5,349 questions)
  • Evaluation: ✅ Complete with comprehensive metrics

Final Results Summary

Metric Textract (OCR → Text LLM) VLM (Direct Vision) Winner
Exact Match (EM) 85.66% 90.41% VLM (+4.75%)
F1 Score 88.20% 92.30% VLM (+4.09%)
Avg Tokens/Question 772 1,199 Textract (1.6x fewer)
Avg Latency 5.58s 13.10s Textract (2.3x faster)
Total Cost (5,349 questions) ~$16 ~$24 Textract (33% cheaper)

Key Findings:

  • 🏆 VLM outperforms Textract by 4.09% F1 score overall
  • 💰 Cost-Accuracy Trade-off: VLM provides better accuracy but costs 1.5x more and takes 2.3x longer
  • 🤝 High Agreement: Both pipelines produce identical results on 84.1% of questions
  • 📊 When They Differ: VLM wins 10.8% of questions, Textract wins 5.1%

Architecture

Pipeline 1: Textract OCR → Text LLM

Image → AWS Textract (DetectDocumentText) → OCR Text → GPT-5 → Answer

Pipeline 2: Vision Language Model

Image → GPT-5 Vision (base64 encoded) → Answer

Data Flow

scripts/01_prepare_dataset.py
    ↓ (Image & Question Manifests)
    ├─→ scripts/02_run_textract.py
    │       ↓ (OCR Results JSONL)
    │   scripts/05_answer_questions_textract.py
    │       ↓ (Answers JSONL)
    └─→ scripts/05_answer_questions_vlm.py
            ↓ (Answers JSONL)
scripts/06_evaluate_qa.py → Evaluation Report (JSON)

Quick Start

Prerequisites

# Clone the repository
git clone https://github.com/longhoag/docqa-ocr-vlm-v2.git
cd docqa-ocr-vlm-v2

# Install Poetry (if not already installed)
curl -sSL https://install.python-poetry.org | python3 -

# Install dependencies
make install

# Configure environment variables
cp .env.example .env
# Edit .env and add:
#   OPENAI_API_KEY=your_openai_api_key
#   AWS_ACCESS_KEY_ID=your_aws_key (for Textract)
#   AWS_SECRET_ACCESS_KEY=your_aws_secret (for Textract)

Running the Complete Pipeline

Step 1: Prepare Dataset (One-Time Setup)

# Downloads DocVQA validation split, extracts images, creates manifests
make prepare
# Output: 5,349 images + image/question manifests

Step 2a: Textract Pipeline

# Test on 5 samples first (recommended)
make textract-sample              # ~$0.01, 30 seconds
make answer-textract-sample       # ~$0.10, 1 minute

# Run full validation split
make textract-full                # ~$8, 2-3 hours (5,349 images)
make answer-textract-full         # ~$16, 6-8 hours (5,349 questions)

Step 2b: VLM Pipeline

# Test on 5 samples first (recommended)
make vlm-sample                   # ~$0.10, 1 minute

# Run full validation split
make vlm-full                     # ~$24, 18-24 hours (5,349 questions)

Step 3: Evaluate & Compare

# Generate comprehensive evaluation report
make evaluate
# Output: outputs/evaluation/validation_evaluation_report.json

View Results

# View evaluation summary
python -c "
import json
with open('outputs/evaluation/validation_evaluation_report.json') as f:
    data = json.load(f)
    om = data['overall_metrics']
    print(f\"Textract - EM: {om['textract']['avg_em']:.2%}, F1: {om['textract']['avg_f1']:.2%}\")
    print(f\"VLM      - EM: {om['vlm']['avg_em']:.2%}, F1: {om['vlm']['avg_f1']:.2%}\")
"

Available Make Commands

Run make help to see all available commands, or use these directly:

Setup & Preparation:

  • make install - Install all dependencies via Poetry
  • make prepare - Download dataset, extract images, create manifests (one-time)

Textract Pipeline:

  • make textract-sample - Process 5 validation images with Textract (test first!)
  • make textract-full - Process all 5,349 images (~$8, 2-3 hours)
  • make answer-textract-sample - Answer 5 questions using Textract+GPT-5
  • make answer-textract-full - Answer all 5,349 questions (~$8, 6-8 hours)

VLM Pipeline:

  • make vlm-sample - Process 5 questions with GPT-5 Vision (test first!)
  • make vlm-full - Process all 5,349 questions (~$24, 18-24 hours)

Evaluation:

  • make evaluate - Generate comprehensive comparison report

Cleanup:

  • make clean - Remove all generated data and outputs

Project Structure

config/                          # Centralized configuration
  config.py                      # Settings, paths, API keys (from .env)
scripts/                         # Pipeline scripts
  01_prepare_dataset.py          # Extract dataset → manifests + images
  02_run_textract.py             # AWS Textract OCR
  05_answer_questions_textract.py # Textract + GPT-5 text model
  05_answer_questions_vlm.py     # GPT-5 Vision model
  06_evaluate_qa.py              # Compare pipelines, generate report
utils/                           # Helper modules
  logger.py                      # Loguru setup
  cost_tracker.py                # Cost/token tracking
  manifest.py                    # JSONL read/write
data/
  raw/images/validation/         # Extracted images (5,349 PNGs, git-ignored)
  processed/manifests/           # Collaboration manifests (git-tracked)
    image/validation.jsonl       # Image paths + metadata
    question/validation.jsonl    # Questions + ground truth
outputs/                         # Pipeline outputs (JSONL format)
  textract/validation/
    validation_textract_outputs.jsonl  # All OCR results (7.7MB, git-tracked)
  answers/textract/validation/
    validation_textract_answers.jsonl  # All answers (git-tracked)
  answers/vlm/validation/
    validation_vlm_answers.jsonl       # All answers (git-tracked)
  evaluation/
    validation_evaluation_report.json  # Metrics report (git-tracked)

Output Formats

All pipeline outputs use JSONL (JSON Lines) for efficient streaming and resumability:

Textract Output (validation_textract_outputs.jsonl)

{"questionId": "12345", "text": "extracted text...", "blocks": [...], "runtime": 1.23, "cost": 0.0015}

Answer Output (validation_{pipeline}_answers.jsonl)

{
  "questionId": "12345",
  "question": "What is the total?",
  "answer": "$1,234.56",
  "confidence": 0.95,
  "ground_truth_answers": ["$1,234.56", "1234.56"],
  "question_types": ["numeric", "extraction"],
  "tokens": {"input": 247, "output": 156, "total": 403},
  "latency_seconds": 2.45,
  "pipeline": "textract",
  "model": "gpt-5"
}

Evaluation Report (validation_evaluation_report.json)

  • Overall metrics: EM, F1, tokens, latency per pipeline
  • Metrics by question type: breakdown by handwritten, form, figure/diagram, layout, others
  • Consistent wins: question types where one pipeline dominates (>70% win rate)
  • Example cases: significant wins, both correct, both wrong

Evaluation Metrics

QA Accuracy

  • Exact Match (EM): Binary score after normalization (lowercase, remove punctuation/articles)
  • Token-level F1: Precision/recall based on token overlap (returns max F1 vs all ground truths)
  • Numeric Tolerance: ±0.5% relative tolerance for numeric answers

Operational Metrics

  • Average/median latency per question
  • Token usage (input/output/total)
  • Estimated cost per 1,000 questions
  • Win distribution by question type

Full Evaluation Results (5,349 Questions)

Overall Performance:

Textract (OCR → Text LLM):
  • Exact Match: 85.66%
  • F1 Score: 88.20%
  • Avg Tokens: 772/question
  • Avg Latency: 5.58s/question
  • Total Cost: ~$16

VLM (Direct Vision):
  • Exact Match: 90.41% (+4.75%)
  • F1 Score: 92.30% (+4.09%)
  • Avg Tokens: 1,199/question (1.6x more)
  • Avg Latency: 13.10s/question (2.3x slower)
  • Total Cost: ~$24 (1.5x more expensive)

Win Distribution:

  • VLM wins: 576 questions (10.8%)
  • Textract wins: 272 questions (5.1%)
  • Ties: 4,501 questions (84.1%)

Key Insight: While VLM provides 4% better accuracy on average, both pipelines produce identical results in 84% of cases. When they differ, VLM outperforms Textract 2:1.

Detailed Performance Comparison

Speed vs Accuracy Trade-off

Dimension Textract Advantage VLM Advantage
Accuracy - +4.09% F1, +4.75% EM
Speed 2.3x faster (5.58s vs 13.10s) -
Cost 33% cheaper ($16 vs $24) -
Token Efficiency 1.6x fewer tokens (772 vs 1,199) -

When to Use Each Pipeline

Choose Textract (OCR → Text LLM) when:

  • ✅ High-volume processing (>10K questions/day)
  • ✅ Cost sensitivity is critical
  • ✅ Response time matters (real-time applications)
  • ✅ 88% F1 accuracy is acceptable
  • ✅ Documents are primarily typed/printed text

Choose VLM (Direct Vision) when:

  • ✅ Maximum accuracy is required (90%+ EM)
  • ✅ Processing handwritten or complex layouts
  • ✅ Handling forms, diagrams, or mixed content
  • ✅ Cost and latency are secondary concerns
  • ✅ Questions require visual understanding beyond text

Hybrid Approach (Recommended for Production):

  1. Use Textract as default (fast + cheap)
  2. Route to VLM when Textract confidence < 0.7
  3. Expected performance: ~89-90% F1 at ~$18-20 cost (optimal balance)

Key Design Decisions

Why JSONL Format?

  • Streaming: Process large datasets line-by-line
  • Resumability: Skip already-processed entries automatically
  • Git-friendly: Line-based format for clean diffs
  • Collaboration: Single consolidated file per split (not thousands of JSON files)

Why Validation Split Only?

The test split has null values for question_types and answers fields, making evaluation impossible. Only validation split (5,349 samples) has proper ground truth.

Token Limits

  • Textract pipeline: 4,000 max_completion_tokens (handles dense OCR text)
  • VLM pipeline: 6,000 max_completion_tokens (complex visual analysis)
  • Both include truncated JSON recovery logic

Cost Tracking

  • Textract: $0.0015/page (logged per API call)
  • GPT-5 Text: $1.25/M input tokens, $10/M output tokens
  • GPT-5 Vision: Higher cost due to image tokens (tracked separately)

Configuration

All settings centralized in config/config.py:

  • Paths (data, outputs, manifests)
  • API keys (loaded from .env)
  • Model names
  • Dataset splits
  • Sample sizes

Never hardcode - always use from config import config.

Development

Dependencies

Managed via Poetry (pyproject.toml):

  • Python 3.13.5
  • OpenAI SDK (GPT-5)
  • Boto3 (AWS Textract)
  • Datasets (Hugging Face)
  • Loguru (logging)

Logging

All scripts use Loguru logger (not print statements):

from loguru import logger
logger.info("Processing {}", image_path)
logger.error("API call failed: {}", error)

Error Handling

  • Retry logic with exponential backoff (Textract API)
  • Truncated JSON recovery (LLM token limits)
  • Empty response detection
  • Graceful failure with detailed error messages

Cost Estimates

Actual Costs (Full Validation Split - 5,349 Questions):

Pipeline Component Unit Cost Actual Cost Notes
Textract OCR $0.0015/page ~$8.00 One-time cost, 5,349 pages
Textract GPT-5 Text $1.25/M in, $10/M out ~$8.00 4.1M total tokens
Textract Total ~$16.00 $2.99 per 1K questions
VLM GPT-5 Vision $2.50/M in, $10/M out ~$24.42 6.4M total tokens
VLM Total ~$24.42 $4.57 per 1K questions

Cost-Accuracy Analysis:

  • VLM costs 53% more than Textract ($24.42 vs $16.00)
  • VLM provides 4.09% better F1 score (92.30% vs 88.20%)
  • Cost per accuracy point: VLM = $5.97/%, Textract = $0.18/%
  • For production at scale (100K questions): Textract = $299, VLM = $457

Recommendation:

  • Use Textract for cost-sensitive, high-volume scenarios where 88% accuracy is acceptable
  • Use VLM for critical applications requiring maximum accuracy (90%+ EM)
  • Consider hybrid approach: Use VLM only for questions where Textract has low confidence

Next Steps

Completed ✅

  1. ✅ Full Textract pipeline (5,349 questions)
  2. ✅ Full VLM pipeline (5,349 questions)
  3. ✅ Comprehensive evaluation and comparison

Future Enhancements

  1. Question Type Analysis: Deep dive into performance by document type (handwritten, forms, diagrams)
  2. Hybrid Pipeline: Combine both approaches using confidence-based routing
  3. Error Analysis: Review cases where both pipelines failed
  4. Cost Optimization: Experiment with smaller vision models (GPT-4 Vision, Claude)
  5. Production Deployment: API endpoint with intelligent pipeline selection

Contributing

This is a collaborative project designed for distributed work:

  • Manifests (image/question JSONL) ensure dataset consistency
  • JSONL outputs are git-tracked for result sharing
  • Cost reports are local-only (git-ignored)
  • All paths relative to config.PROJECT_ROOT

License

MIT License

About

DocQA pipelines (OCR with AWS Textract and VLM) research showing performance and efficiency of the two pipelines

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •