Skip to content

Switch from pdfminer to paves to improve robustness and use multiple CPUs #4067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

dhdaines
Copy link
Contributor

PLAYA-PDF is a fork of pdfminer.six with a focus on robustness and efficiency. (full disclosure: it's my fork of pdfminer.six)

Unfortunately, PLAYA a'int a LAYout Analyzer - so it cannot replace pdfminer.six directly.

But, never fear, on top of PLAYA there is PAVÉS, which among other things implements the pdfminer.six layout analysis algorithms to the extent that it is mostly (but not entirely) a drop-in replacement. It is not actually faster than pdfminer.six for various reasons, but it does allow you to distribute PDF parsing across multiple CPUs, so that may help.

Because I am a bit tired of having to pin versions of pdfminer.six due to bugs and parsing failures... here is a PR with exactly that, PAVÉS dropped-in to replace pdfminer.six. I did remove the pikepdf "repairing" code as well since in general this is much more robust, but perhaps you would like to put it back!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant