So you want to parse a PDF?
Suppose you have an appetite for tilting at windmills. Let's say you love pain. Well then why not write a PDF parser today?
The ideal world: how the specification should work
Conceptually parsing a PDF is fairly simple:
First, locate the version header comment at the start of the file
Next you need to locate the pointer to the cross-reference
Then you can find all object offsets
Finally you locate and build the trailer dictionary which points to the catalog dicitionary
Introduction to PDF object...
Read more at eliot-jones.com