pages tagged file formatsrohieb.namehttps://rohieb.name/blag/tag/file_formats/rohieb.nameikiwiki2023-06-09T08:12:10ZOptimizing XSane's scanned PDFs (also: PDF internals)https://rohieb.name/blag/post/optimizing-xsane-s-scanned-pdfs/rohieb
CC-BY-SA 3.0
2023-06-09T08:12:10Z2013-11-17T22:58:35Z
<h2 id="problem">Problem</h2>
<p>I use <a href="http://www.xsane.org/" title="XSane homepage">XSane</a> to scan documents for my digital archive. I want them to be
in PDF format and have a reasonable resolution (better than 200 dpi, so I
can try OCRing them afterwards). However, the PDFs created by XSane’s multipage
mode are too large, about 250 MB for a 20-page document scanned at
200 dpi.</p>
<table class="img"><caption>XSane’s Multipage mode</caption><tr><td><a href="https://rohieb.name/blag/post/optimizing-xsane-s-scanned-pdfs/xsane-multipage-mode.png"><img src="https://rohieb.name/blag/post/optimizing-xsane-s-scanned-pdfs/x200-xsane-multipage-mode.png" width="223" height="200" class="img" /></a></td></tr></table>
<h2 id="firstnon-optimalsolution">First (non-optimal) solution</h2>
<p>At first, I tried to optimize the PDF using <a href="http://ghostscript.com" title="Ghostscript homepage">GhostScript</a>. I
<a href="https://rohieb.name/blag/post/use-ghostscript-to-convert-pdf-files/">already wrote</a> about how GhostScript’s
<code>-dPDFSETTINGS</code> option can be used to minimize PDFs by redering the pictures to
a smaller resolution. In fact, there are <a href="http://milan.kupcevic.net/ghostscript-ps-pdf/#refs" title="Ghostscript PDF Reference & Tips">multiple rendering modes</a>
(<code>screen</code> for 96 dpi, <code>ebook</code> for 150 dpi, <code>printer</code> for 300 dpi,
and <code>prepress</code> for color-preserving 300 dpi), but they are pre-defined, and
for my 200 dpi images, <code>ebook</code> was not enough (I would lose resolution),
while <code>printer</code> was too high and would only enlarge the PDF.</p>
<h2 id="interlude:pdfinternals">Interlude: PDF Internals</h2>
<p>The best thing to do was to find out how the images were embedded in the PDF.
Since most PDF files are also partly human-readable, I opened my file with vim.
(Also, I was surprised that <a href="https://rohieb.name/blag/tag/file_formats/vim-syntax-highlighting.png">vim has syntax highlighting for
PDF</a>.) Before we continue, I'll give a short
introduction to the PDF file format (for the long version, see <a href="http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf" title="Adobe Portable Document Format, Version 1.4">Adobe’s PDF
reference</a>).</p>
<h3 id="buildingblocks">Building Blocks</h3>
<p>Every PDF file starts with the <a href="https://en.wikipedia.org/wiki/Magic_number_(programming)#Magic_numbers_in_files" title="Wikipedia: Magic numbers in files">magic string</a> that identifies the version
of the standard which the document conforms to, like <code>%PDF-1.4</code>. After that, a
PDF document is made up of the following objects:</p>
<dl>
<dt>Boolean values</dt>
<dd>
<code>true</code> and <code>false</code>
</dd>
<dt>Integers and floating-point numbers</dt>
<dd>
for example, <code>1337</code>, <code>-23.42</code> and <code>.1415</code>
</dd>
<dt>Strings</dt>
<dd>
<ul>
<li>interpreted as literal characters when enclosed in parentheses: <code>(This
is a string.)</code> These can contain escaped characters, particularly
escaped closing braces and control characters: <code>(This string contains a
literal \) and some\n newlines.\n)</code>.</li>
<li>interpreted as hexadecimal data when enclosed in angled brackets:
<code><53 61 6D 70 6C 65></code> equals <code>(Sample)</code>.</li>
</ul>
</dd>
<dt>Names</dt>
<dd>
starting with a forward slash, like <code>/Type</code>. You can think of them like
identifiers in programming languages.
</dd>
<dt>Arrays</dt>
<dd>
enclosed in square brackets:
<code>[ -1 4 6 (A String) /AName [ (strings in arrays in arrays!) ] ]</code>
</dd>
<dt>Dictionaries</dt>
<dd>
key-value stores, which are enclosed in double angled brackets. The key must
be a name, the value can be any object. Keys and values are given in turns,
beginning with the first key:
<code><< /FirstKey (First Value) /SecondKey 3.14 /ThirdKey /ANameAsValue >></code>
Usually, the first key is <code>/Type</code> and defines what the dictionary actually
describes.
</dd>
<dt>Stream Objects</dt>
<dd>
a collection of bytes. In contrast to strings, stream objects are usually
used for large amount of data which may not be read entirely, while strings
are always read as a whole. For example, streams can be used to embed images
or metadata.
</dd>
<dd>
Streams consist of a dictionary, followed by the keyword <code>stream</code>, the raw
content of the stream, and the keyword <code>endstream</code>. The dictionary describes
the stream’s length and the filters that have been applied to it, which
basically define the encoding the data is stored in. For example, data
streams can be compressed with various algorithms.
</dd>
<dt>The Null Object</dt>
<dd>
Represented by the literal string <code>null</code>.
</dd>
<dt>Indirect Objects</dt>
<dd>
Every object in a PDF document can also be stored as a indirect object,
which means that it is given a label and can be used multiple times in the
document. The label consists of two numbers, a positive <em>object number</em>
(which makes the object unique) and a non-negative <em>generation number</em>
(which allows to incrementally update objects by appending to the file).
</dd>
<dd>
Indirect objects are defined by their object number, followed by their
generation number, the keyword <code>obj</code>, the contents of the object, and the
keyword <code>endobj</code>. Example: <code>1 0 obj (I'm an object!) endobj</code> defines the
indirect object with object number 1 and generation number 0, which consists
only of the string “I'm an object!”. Likewise, more complex data structures
can be labeled with indirect objects.
</dd>
<dd>
Referencing an indirect object works by giving the object and generation
number, followed by an uppercase R: <code>1 0 R</code> references the object created
above. References can be used everywhere where a (direct) object could be
used instead.
</dd>
</dl>
<p>Using these object, a PDF document builds up a tree structure, starting from the
root object, which has the object number 1 and is a dictionary with the value
<code>/Catalog</code> assigned to the key <code>/Type</code>. The other values of this dictionary
point to the objects describing the outlines and pages of the document, which in
turn reference other objects describing single pages, which point to objects
describing drawing operations or text blocks, etc.</p>
<h3 id="dissectingthepdfscreatedbyxsane">Dissecting the PDFs created by XSane</h3>
<p>Now that we know how a PDF document looks like, we can go back to out initial
problem and try to find out why my PDF file was so huge. I will walk you through
the PDF object by object.</p>
<div class="highlight-pdf"><pre class="hl"><span class="hl ppc">%PDF-1.4</span>
<span class="hl kwa">1 0 obj</span>
<span class="hl kwb"><<</span> <span class="hl kwc">/Type /Catalog</span>
<span class="hl kwc">/Outlines</span> <span class="hl kwa">2 0 R</span>
<span class="hl kwc">/Pages</span> <span class="hl kwa">3 0 R</span>
<span class="hl kwb">>></span>
<span class="hl kwa">endobj</span>
</pre></div>
<p>This is just the magic string declaring the document as PDF-1.4, and the root
object with object number 1, which references objects number 2 for Outlines and
number 3 for Pages. We're not interested in outlines, let's look at the pages.</p>
<div class="highlight-pdf"><pre class="hl"><span class="hl kwa">3 0 obj</span>
<span class="hl kwb"><<</span> <span class="hl kwc">/Type /Pages</span>
<span class="hl kwc">/Kids</span> <span class="hl kwb">[</span>
<span class="hl kwa">6 0 R</span>
<span class="hl kwa">8 0 R</span>
<span class="hl kwa">10 0 R</span>
<span class="hl kwa">12 0 R</span>
<span class="hl kwb">]</span>
<span class="hl kwc">/Count</span> <span class="hl num">4</span>
<span class="hl kwb">>></span>
<span class="hl kwa">endobj</span>
</pre></div>
<p>OK, apparently this document has four pages, which are referenced by objects
number 6, 8, 10 and 12. This makes sense, since I scanned four pages ;-)</p>
<p>Let's start with object number 6:</p>
<div class="highlight-pdf"><pre class="hl"><span class="hl kwa">6 0 obj</span>
<span class="hl kwb"><<</span> <span class="hl kwc">/Type /Page</span>
<span class="hl kwc">/Parent</span> <span class="hl kwa">3 0 R</span>
<span class="hl kwc">/MediaBox</span> <span class="hl kwb">[</span><span class="hl num">0 0 596 842</span><span class="hl kwb">]</span>
<span class="hl kwc">/Contents</span> <span class="hl kwa">7 0 R</span>
<span class="hl kwc">/Resources</span> <span class="hl kwb"><<</span> <span class="hl kwc">/ProcSet</span> <span class="hl kwa">8 0 R</span> <span class="hl kwb">>></span>
<span class="hl kwb">>></span>
<span class="hl kwa">endobj</span>
</pre></div>
<p>We see that object number 6 is a page object, and the actual content is in
object number 7. More redirection, yay!</p>
<div class="highlight-pdf"><pre class="hl"><span class="hl kwa">7 0 obj</span>
<span class="hl kwb"><<</span> <span class="hl kwc">/Length</span> <span class="hl num">2678332</span> <span class="hl kwb">>></span>
<span class="hl str">stream</span>
<span class="hl str">q</span>
<span class="hl str">1 0 0 1 0 0 cm</span>
<span class="hl str">1.000000 0.000000 -0.000000 1.000000 0 0 cm</span>
<span class="hl str">595.080017 0 0 841.679993 0 0 cm</span>
<span class="hl str">BI</span>
<span class="hl str"> /W 1653</span>
<span class="hl str"> /H 2338</span>
<span class="hl str"> /CS /G</span>
<span class="hl str"> /BPC 8</span>
<span class="hl str"> /F /FlateDecode</span>
<span class="hl str">ID</span>
<span class="hl str">x$¼[$;¾åù!fú¥¡aæátq.4§ [ ...byte stream shortened... ]</span>
<span class="hl str">EI</span>
<span class="hl str">Q</span>
<span class="hl str">endstream</span>
<span class="hl kwa">endobj</span>
</pre></div>
<p>Aha, here is where the magic happens. Object number 7 is a stream object of
2,678,332 bytes (about 2 MB) and contains drawing operations! After skipping
around a bit in Adobe’s PDF reference (chapters 3 and 4), here is the annotated
version of the stream content:</p>
<div class="highlight-pdf"><pre class="hl">q <span class="hl slc">% Save drawing context</span>
<span class="hl num">1 0 0 1 0 0</span> cm <span class="hl slc">% Set up coordinate space for image</span>
<span class="hl num">1.000000 0.000000 -0.000000 1.000000 0 0</span> cm
<span class="hl num">595.080017 0 0 841.679993 0 0</span> cm
BI <span class="hl slc">% Begin Image</span>
<span class="hl kwc">/W</span> <span class="hl num">1653</span> <span class="hl slc">% Image width is 1653 pixel</span>
<span class="hl kwc">/H</span> <span class="hl num">2338</span> <span class="hl slc">% Image height is 2338 pixel</span>
<span class="hl kwc">/CS /G</span> <span class="hl slc">% Color space is Gray</span>
<span class="hl kwc">/BPC</span> <span class="hl num">8</span> <span class="hl slc">% 8 bits per pixel</span>
<span class="hl kwc">/F /FlateDecode</span> <span class="hl slc">% Filters: data is Deflate-compressed</span>
ID <span class="hl slc">% Image Data follows:</span>
x$¼<span class="hl kwb">[</span>$;¾åù!fú¥¡aæátq<span class="hl num">.4</span>§ <span class="hl kwb">[</span> ...byte stream shortened... <span class="hl kwb">]</span>
EI <span class="hl slc">% End Image</span>
Q <span class="hl slc">% Restore drawing context</span>
</pre></div>
<p>So now we know why the PDF was so huge: the line <code>/F /FlateDecode</code> tells us that
the image data is stored losslessly with <a href="https://en.wikipedia.org/wiki/DEFLATE" title="Wikipedia: DEFLATE algorithm">Deflate</a> compression (this is
basically what PNG uses). However, scanned images, as well as photographed
pictures, have the tendency to become very big when stored losslessly, due to te
fact that image sensors always add noise from the universe and lossless
compression also has to take account of this noise. In contrast, lossy
compression like JPEG, which uses <a href="http://en.wikipedia.org/wiki/Discrete_cosine_transform" title="Wikipedia: Discrete cosine transform">discrete cosine transform</a>, only has to
approximate the image (and therefore the noise from the sensor) to a certain
degree, therefore reducing the space needed to save the image. And the PDF
standard also allows image data to be DCT-compressed, by adding <code>/DCTDecode</code> to
the filters.</p>
<h2 id="secondsolution:useabettercompressionalgorithm">Second solution: use a (better) compression algorithm</h2>
<p>Now that I knew where the problem was, I could try to create PDFs with DCT
compression. I still had the original, uncompressed <a href="https://en.wikipedia.org/wiki/Netpbm_format" title="Wikipedia: Netpbm format">PNM</a> files that fell out
of XSane’ multipage mode (just look in the multipage project folder), so I
started to play around a bit with <a href="http://www.imagemagick.org" title="ImageMagic homepage">ImageMagick’s</a> <code>convert</code> tool, which can
also convert images to PDF.</p>
<h3 id="convertingpnmtopdf">Converting PNM to PDF</h3>
<p>First, I tried converting the umcompressed PNM to PDF:</p>
<pre><code>$ convert image*.pnm document.pdf
</code></pre>
<p><code>convert</code> generally takes parameters of the form <code>inputfile outputfile</code>, but it
also allows us to specify more than one input file (which is somehow
undocumented in the <a href="http://manpages.debian.net/cgi-bin/man.cgi?query=convert" title="man convert(1)">man page</a>). In that case it tries to create
multi-page documents, if possible. With PDF as output format, this results in
one input file per page.</p>
<p>The embedded image objects looked somewhat like the following:</p>
<div class="highlight-pdf"><pre class="hl"><span class="hl kwa">8 0 obj</span>
<span class="hl kwb"><<</span>
<span class="hl kwc">/Type /XObject</span>
<span class="hl kwc">/Subtype /Image</span>
<span class="hl kwc">/Name /Im0</span>
<span class="hl kwc">/Filter</span> <span class="hl kwb">[</span> <span class="hl kwc">/RunLengthDecode</span> <span class="hl kwb">]</span>
<span class="hl kwc">/Width</span> <span class="hl num">1653</span>
<span class="hl kwc">/Height</span> <span class="hl num">2338</span>
<span class="hl kwc">/ColorSpace</span> <span class="hl kwa">10 0 R</span>
<span class="hl kwc">/BitsPerComponent</span> <span class="hl num">8</span>
<span class="hl kwc">/Length</span> <span class="hl kwa">9 0 R</span>
<span class="hl kwb">>></span>
<span class="hl str">stream</span>
<span class="hl str">% [ raw byte data ]</span>
<span class="hl str">endstream</span>
</pre></div>
<p>The filter <code>/RunLengthDecode</code> indicates that the stream data is compressed with
<a href="https://en.wikipedia.org/wiki/Run-length_encoding" title="Wikipedia: Run-length encoding">Run-length encoding</a>, another simple lossless compression. Not what I
wanted. (Apart from that, <code>convert</code> embeds images as XObjects, but there is not
much difference to the inline images described above.)</p>
<h3 id="convertingpnmtojpgthentopdf">Converting PNM to JPG, then to PDF</h3>
<p>Next, I converted the PNMs to JPG, then to PDF.</p>
<pre><code>$ convert image*.pnm image.jpg
$ convert image*jpg document.pdf
</code></pre>
<p>(The first command creates the output files <code>image-1.jpg</code>, <code>image-2.jpg</code>, etc.,
since JPG does not support multiple pages in one file.)</p>
<p>When looking at the PDF, we see that we now have DCT-compressed images inside
the PDF:</p>
<div class="highlight-pdf"><pre class="hl"><span class="hl kwa">8 0 obj</span>
<span class="hl kwb"><<</span>
<span class="hl kwc">/Type /XObject</span>
<span class="hl kwc">/Subtype /Image</span>
<span class="hl kwc">/Name /Im0</span>
<span class="hl kwc">/Filter</span> <span class="hl kwb">[</span> <span class="hl kwc">/DCTDecode</span> <span class="hl kwb">]</span>
<span class="hl kwc">/Width</span> <span class="hl num">1653</span>
<span class="hl kwc">/Height</span> <span class="hl num">2338</span>
<span class="hl kwc">/ColorSpace</span> <span class="hl kwa">10 0 R</span>
<span class="hl kwc">/BitsPerComponent</span> <span class="hl num">8</span>
<span class="hl kwc">/Length</span> <span class="hl kwa">9 0 R</span>
<span class="hl kwb">>></span>
<span class="hl str">stream</span>
<span class="hl str">% [ raw byte data ]</span>
<span class="hl str">endstream</span>
</pre></div>
<h3 id="convertingpnmtojpgthentopdfandfixpagesize">Converting PNM to JPG, then to PDF, and fix page size</h3>
<p>However, the pages in <code>document.pdf</code> are 82.47×58.31 cm, which results in
about 72 dpi in respect to the size of the original images. But <code>convert</code>
also allows us to specify the pixel density, so we'll set that to 200 dpi
in X and Y direction, which was the resolution at which the images were scanned:</p>
<pre><code>$ convert image*jpg -density 200x200 document.pdf
</code></pre>
<p><em>Update:</em> You can also use the <a href="http://www.imagemagick.org/script/command-line-options.php#page" title="ImageMagick: Command-line Options"><code>-page</code> parameter</a> to set the page size
directly. It takes a multitude of predefined paper formats (see link) and will
do the pixel density calculation for you, as well as adding any neccessary
offset if the image ratio is not quite exact:</p>
<pre><code>$ convert image*jpg -page A4 document.pdf
</code></pre>
<p>With that approach, I could reduce the size of my PDF from 250 MB with
losslessly compressed images to 38 MB with DCT compression.</p>
<p><em>Another update (2023):</em> Marcus notified me that it is possible to use
ImageMagick's <code>-compress jpeg</code> option, this way we can leave out the
intermediate step and convert PNM to PDF directly:</p>
<pre><code>$ convert image*.pnm -compress jpeg -quality 85 output.pdf
</code></pre>
<p>You can also play around with the <code>-quality</code> parameter to set the JPEG
compression level (100% makes almost pristine, but huge images; 1% makes very
small, very blocky images), 85% should still be readable for most documents
in that resolution.</p>
<h2 id="toolongdidntread">Too long, didn’t read</h2>
<p>Here’s the gist for you:</p>
<ul>
<li>Read the article above, it’s very comprehensive :P</li>
<li><p>Use <code>convert</code> on XSane’s multipage images and specify your
scanning resolution:</p>
<pre><code>$ convert image*.pnm image.jpg
$ convert image*jpg -density 200x200 document.pdf
</code></pre></li>
</ul>
<h2 id="furtherreading">Further reading</h2>
<p>There is probably software out there which does those thing for you, with a
shiny user interface, but I could not find one quickly. What I did find though,
was <a href="http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/" title="Konrad Voelkel: Linux, OCR and PDF: Scan to PDF/A">this detailed article</a>, which describes how to get
high-resolution scans wihh OCR information in PDF/A and DjVu format, using
<code>scantailor</code> and <code>unpaper</code>.</p>
<p>Also, Didier Stevens helped me understand stream objects in in his
<a href="http://blog.didierstevens.com/2008/05/19/pdf-stream-objects/" title="Didier Stevens: PDF Stream Objects">illustrated blogpost</a>. He seems to write about PDF more
often, and it was fun to poke around in his blog. There is also a nice script,
<a href="http://blog.didierstevens.com/programs/pdf-tools/" title="Didier Stevens: PDF Tools"><code>pdf-parser</code></a>, which helps you visualize the structure of a PDF
document.</p>