Switch from pdfminer to paves to improve robustness and use multiple CPUs #4067
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PLAYA-PDF is a fork of pdfminer.six with a focus on robustness and efficiency. (full disclosure: it's my fork of pdfminer.six)
Unfortunately, PLAYA a'int a LAYout Analyzer - so it cannot replace pdfminer.six directly.
But, never fear, on top of PLAYA there is PAVÉS, which among other things implements the pdfminer.six layout analysis algorithms to the extent that it is mostly (but not entirely) a drop-in replacement. It is not actually faster than pdfminer.six for various reasons, but it does allow you to distribute PDF parsing across multiple CPUs, so that may help.
Because I am a bit tired of having to pin versions of pdfminer.six due to bugs and parsing failures... here is a PR with exactly that, PAVÉS dropped-in to replace pdfminer.six. I did remove the pikepdf "repairing" code as well since in general this is much more robust, but perhaps you would like to put it back!