pdf

PDF parser/serialiser written in golang

See the project

Building a PDF Parser/Serializer Library in Golang

Originally, I created this project as a small tool to bypass PDF DRM restrictions. However as I continued on the project, it became difficult to tweak as I was writing it as fast as possible. At some point I decided to take a step back and separate DRM removal from parsing/serialising (as I should’ve done from the start). The result is a library with no dependencies that (mostly) conforms to the PDF spec and doesn’t care about any security measures (passwords, encryption, etc) that may be in place. It’ll be tripped up by some of the more obscure features of the PDF spec, but for my purposes, it works well enough.

This library isn’t a replacement for a full featured PDF library like pdfcpu, but it won’t refuse to parse PDFs with encrypted datastreams like others.

PDF parsing

Parsing PDFs is a task notorious for its complexity. The PDF specification, spanning multiple revisions, is extensive. The format’s flexibility, which benefits end users, presents challenges to developers due to the numerous edge cases that must be addressed.

For instance, consider the case of tokenizing and reading tokens, specifically the endstream keyword. In some instances, this keyword can appear within the stream data as a string literal. Parsing without consulting the stream metadata where the stream length is usually specified can lead to unexpected complications.

PDF serialisation

Serialisation is somewhat easier compared to parsing as once the PDF is in a data structure (e.g AST) its just a matter of traversing the nodes and writing them out to a buffer or file. However, because the PDF spec allows for “incremental updates”, the document structure changes (more body, trailer and xref objects are apended), which during parsing, you won’t realise until you reach such appended objects, you need to decide if you want to handle reproducing the document perfectly or just compress it all into a single “simple” document.

Future plans

I’d like to expand further in the parser/compiler space and write a parser combinator library in Golang to make it easier to write parsers for other formats (and perhaps rewrite this library to use it).