Use latest version of objects from object streams (#1) #169

mjbryant · 2019-06-29T21:06:37Z

Previously, when an object was parsed from an object stream and it referenced an indirect object, it'd pull the current version of that object at parse time. This means if you have an object stream that declares updated versions of two objects, the first of which references the second, the first object will have the incorrect old value for the second object. For example, if the content of an object stream is something like (formatted for clarity, and with probably incorrect offsets):

1 0 2 40 
<</Count 3 /Kids [2 0 R] /Type /Pages>>
<</Count 3 /Kids [4 0 R 5 0 R 6 0 R] /Parent 1 0 R /Type /Pages>>

The object stream here defines both objects (1, 0) and (2, 0). If this is an incremental update for (2, 0), the previous version of the code would make /Kids for (1, 0) the previous version of (2, 0). This was manifesting in several PDFs we found in the wild as incorrect page counts. The PDFs had added additional pages in incremental updates, and the old /Pages objects with incorrect kids were getting used.

I've ran this branch against all pdfrw tests and they all still pass. This includes roundtrips for lots of existing PDFs, so I'm fairly confident that it's not going to break the status quo. It also fixes several of the PDFs that broke for us on pdfrw master.

* Load object streams starting from latest, and don't clobber later versions of objects from object streams * Ignore pyenv's local file

pmaupin · 2019-06-29T21:18:13Z

Thank you. I will have some time to look at this late next month.

…

On Sat, Jun 29, 2019 at 4:06 PM Michael Bryant ***@***.***> wrote: Previously, when an object was parsed from an object stream and it referenced an indirect object, it'd pull the current version of that object at parse time. This means if you have an object stream that declares updated versions of two objects, the first of which references the second, the first object will have the incorrect old value for the second object. For example, if the content of an object stream is something like (formatted for clarity, and with probably incorrect offsets): 1 0 2 40 <</Count 3 /Kids [2 0 R] /Type /Pages>> <</Count 3 /Kids [4 0 R 5 0 R 6 0 R] /Parent 1 0 R /Type /Pages>> The object stream here defines both objects (1, 0) and (2, 0). If this is an incremental update for (2, 0), the previous version of the code would make /Kids for (1, 0) the previous version of (2, 0). This was manifesting in several PDFs we found in the wild as incorrect page counts. The PDFs had added additional pages in incremental updates, and the old /Pages objects with incorrect kids were getting used. I've ran this branch against all pdfrw tests and they all still pass. This includes roundtrips for lots of existing PDFs, so I'm fairly confident that it's not going to break the status quo. It also fixes several of the PDFs that broke for us on pdfrw master. ------------------------------ You can view, comment on, or merge this pull request online at: #169 Commit Summary - Use latest version of objects from object streams (#1) File Changes - *M* .gitignore <https://github.com/pmaupin/pdfrw/pull/169/files#diff-0> (6) - *M* pdfrw/pdfreader.py <https://github.com/pmaupin/pdfrw/pull/169/files#diff-1> (42) Patch Links: - https://github.com/pmaupin/pdfrw/pull/169.patch - https://github.com/pmaupin/pdfrw/pull/169.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#169?email_source=notifications&email_token=AASE2NRWIUTN3RYLTILWOCLP47FF5A5CNFSM4H4LHJSKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G4OZCAA>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASE2NUKTJHOM2LSDSSUVBLP47FF5ANCNFSM4H4LHJSA> .

SP-3760 Guard against infinite loop page trees

Use latest version of objects from object streams (#1)

938fe34

* Load object streams starting from latest, and don't clobber later versions of objects from object streams * Ignore pyenv's local file

mjbryant and others added 4 commits July 2, 2019 08:24

Fix trailer update in alternate object resolution order (#2)

b1f336c

SUB-1024 - Remove token cache (#3)

e0b32cf

SP-3760 Guard against infinite loop page trees

491a361

Merge pull request #4 from plangrid/sp-3760-fix-infinite-loop

faa9c2d

SP-3760 Guard against infinite loop page trees

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use latest version of objects from object streams (#1) #169

Use latest version of objects from object streams (#1) #169

Uh oh!

mjbryant commented Jun 29, 2019

Uh oh!

pmaupin commented Jun 29, 2019 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use latest version of objects from object streams (#1) #169

Are you sure you want to change the base?

Use latest version of objects from object streams (#1) #169

Uh oh!

Conversation

mjbryant commented Jun 29, 2019

Uh oh!

pmaupin commented Jun 29, 2019 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants