I very much like the new Page.getTextBlocks(images=True), thank you for adding that :). One of the most important uses for us is to to calculate the
% of the page area covered by an image blocks
- and
% of the page area covered by text blocks.
We need these derived values to assume (with a threshold) if the page needs to be processed with OCR as part of our pipeline. We are calculating the total union area of rectangle blocks using packages with C bindings like using Numpy or Shapely however we hate having these requirements ALTHOUGH doing this in straight python could be much slower.
This is general feature request: it would be nice to have metrics like this (with a highly performant implementation) as part of the page model