Skip to content

Conversation

dhdaines
Copy link
Contributor

PLAYA-PDF is a fork of pdfminer.six with a focus on robustness and efficiency. (full disclosure: it's my fork of pdfminer.six)

Unfortunately, PLAYA a'int a LAYout Analyzer - so it cannot replace pdfminer.six directly.

But, never fear, on top of PLAYA there is PAVÉS, which among other things implements the pdfminer.six layout analysis algorithms to the extent that it is mostly (but not entirely) a drop-in replacement. It is not actually faster than pdfminer.six for various reasons, but it does allow you to distribute PDF parsing across multiple CPUs, so that may help.

Because I am a bit tired of having to pin versions of pdfminer.six due to bugs and parsing failures... here is a PR with exactly that, PAVÉS dropped-in to replace pdfminer.six. I did remove the pikepdf "repairing" code as well since in general this is much more robust, but perhaps you would like to put it back!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant