Switch from pdfminer to paves to improve robustness and use multiple CPUs #4067

dhdaines · 2025-07-19T04:10:46Z

PLAYA-PDF is a fork of pdfminer.six with a focus on robustness and efficiency. (full disclosure: it's my fork of pdfminer.six)

Unfortunately, PLAYA a'int a LAYout Analyzer - so it cannot replace pdfminer.six directly.

But, never fear, on top of PLAYA there is PAVÉS, which among other things implements the pdfminer.six layout analysis algorithms to the extent that it is mostly (but not entirely) a drop-in replacement. It is not actually faster than pdfminer.six for various reasons, but it does allow you to distribute PDF parsing across multiple CPUs, so that may help.

Because I am a bit tired of having to pin versions of pdfminer.six due to bugs and parsing failures... here is a PR with exactly that, PAVÉS dropped-in to replace pdfminer.six. I did remove the pikepdf "repairing" code as well since in general this is much more robust, but perhaps you would like to put it back!

This allows us to also remove PDF repair and monkey patching.

David Huggins-Daines and others added 12 commits July 19, 2025 00:02

feat: switch from pdfminer to paves

a1b94cc

This allows us to also remove PDF repair and monkey patching.

fix: manually hack deps since who knows how they get generated

ac2b2e7

chore: black and ruff

6cd328d

fix(tests): repair no longer necessary

a5f00e5

fix: avoid importing pypdf just to count pages!

8ec45e0

fix: playa needs "" as default password not None

a489d29

fix: require playa-pdf 0.6.2 for colormap issue

318a954

fix: isort

2f87d89

fix(tests): playa/paves do not output (cid:N) droppings

e79845f

fix(tests): update indices since (cid:N) no longer occurs

afb1288

fix(tests): update markdown and html fixtures

ea36f10

fix(tests): fix missing or not missing newline for silly diff

e999734

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Switch from pdfminer to paves to improve robustness and use multiple CPUs #4067

Switch from pdfminer to paves to improve robustness and use multiple CPUs #4067

Uh oh!

dhdaines commented Jul 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Switch from pdfminer to paves to improve robustness and use multiple CPUs #4067

Are you sure you want to change the base?

Switch from pdfminer to paves to improve robustness and use multiple CPUs #4067

Uh oh!

Conversation

dhdaines commented Jul 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant