Skip to content

Conversation

@mjbryant
Copy link

Previously, when an object was parsed from an object stream and it referenced an indirect object, it'd pull the current version of that object at parse time. This means if you have an object stream that declares updated versions of two objects, the first of which references the second, the first object will have the incorrect old value for the second object. For example, if the content of an object stream is something like (formatted for clarity, and with probably incorrect offsets):

1 0 2 40 
<</Count 3 /Kids [2 0 R] /Type /Pages>>
<</Count 3 /Kids [4 0 R 5 0 R 6 0 R] /Parent 1 0 R /Type /Pages>>

The object stream here defines both objects (1, 0) and (2, 0). If this is an incremental update for (2, 0), the previous version of the code would make /Kids for (1, 0) the previous version of (2, 0). This was manifesting in several PDFs we found in the wild as incorrect page counts. The PDFs had added additional pages in incremental updates, and the old /Pages objects with incorrect kids were getting used.

I've ran this branch against all pdfrw tests and they all still pass. This includes roundtrips for lots of existing PDFs, so I'm fairly confident that it's not going to break the status quo. It also fixes several of the PDFs that broke for us on pdfrw master.

* Load object streams starting from latest, and don't clobber later
versions of objects from object streams

* Ignore pyenv's local file
@pmaupin
Copy link
Owner

pmaupin commented Jun 29, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants