Working with PDF Streams

A pikepdf.Stream object works like a PDF dictionary with some encoded bytes attached. The dictionary is metadata that describes how the stream is encoded. PDF can, and regularly does, use a variety of encoding filters. A stream can be encoded with one or more filters. Images are a type of stream object.

Most of the interesting content in a PDF (images and content streams) are inside page objects.

Because the PDF specification unfortunately defines several terms involve the word stream, let’s attempt to clarify:

stream object
A PDF object that contains binary data and a metadata dictionary to describes it, represented as pikepdf.Stream. In HTML this is equivalent to a <img> with inline image data.
object stream
A stream object (not a typo, an object stream really is a type of stream object) in a PDF that contains a number of other objects in a PDF, grouped together for better compression. In pikepdf there is an option to save PDFs with this feature enabled to improve compression. Otherwise, this is just a detail about how PDF files are encoded.
content stream
A stream object that contains some instructions to draw graphics and text on a page, or inside a Form XObject. In HTML this is equivalent to the HTML file itself. Content streams do not cross pages.
Form XObject
A group of images, text and drawing commands that can be rendered elsewhere in a PDF as a group. This is often used when a group of objects are needed at different scales or multiple pages. In HTML this is like an <svg>.

Reading stream objects

Fortunately, pikepdf.Stream.read_bytes() will apply all filters and decode the uncompressed bytes, or throw an error if this is not possible. pikepdf.Stream.read_raw_bytes() provides access to the compressed bytes.

For example, we can read the XMP metadata, however it is encoded, from a PDF with the following:

>>> xmp = example.root.Metadata.read_bytes()
>>> type(xmp)
bytes
>>> print(xmp.decode())
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<?adobe-xap-filters esc="CRLF"?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/'>
<rdf:Description rdf:about='' xmlns:pdf='http://ns.adobe.com/pdf/1.3/' pdf:Producer='GPL Ghostscript 9.21'/>
<rdf:Description rdf:about='' xmlns:xmp='http://ns.adobe.com/xap/1.0/'><xmp:ModifyDate>2017-09-11T13:27:48-07:00</xmp:ModifyDate>
<xmp:CreateDate>2017-09-11T13:27:48-07:00</xmp:CreateDate>
<xmp:CreatorTool>ocrmypdf 5.3.3 / Tesseract OCR-PDF 3.05.01</xmp:CreatorTool></rdf:Description>
<rdf:Description rdf:about='' xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' xapMM:DocumentID='uuid:39bce560-cf4c-11f2-0000-61a4fb67ccb7'/>
<rdf:Description rdf:about='' xmlns:dc='http://purl.org/dc/elements/1.1/' dc:format='application/pdf'><dc:title><rdf:Alt><rdf:li xml:lang='x-default'>Untitled</rdf:li></rdf:Alt></dc:title></rdf:Description>
<rdf:Description rdf:about='' xmlns:pdfaid='http://www.aiim.org/pdfa/ns/id/' pdfaid:part='2' pdfaid:conformance='B'/></rdf:RDF>
</x:xmpmeta>
<?xpacket end='w'?>

That lets us see a few facts about this file. It was created by OCRmyPDF and Tesseract OCR’s PDF generator. Ghostscript was used to convert it to PDF-A (the xmlns:pdfaid tag).

Of course, it would be far more convenient to use the pikepdf PDF Metadata interface than manual parse this XML object. It just so happens this is a human readable object found in most PDFs.

Parsing content streams

When a stream object is a content stream, you probably want to parse the content stream to interpret it.

pikepdf provides a C++ optimized content stream parser.

>>> pdf = pikepdf.open(input_pdf)
>>> page = pdf.pages[0]
>>> for operands, command in parse_content_stream(page):
>>>     print(command)