OCR'ing and Optimising Scanned PDF Documents

Introduction
Say that you are faced with a large collection of scanned documents that has been growing over years, and the disk is now full.

Those scanned files have been created with a mixture of simple scanners, old software, and more advanced, proprietary scanning software. Some sheets of paper have been scanned as plain JPEG or TIFF bitmaps. Some are PDF files with one such large bitmap per page, others use the Mixed Raster Content compression technique. The scan resolution is different too: 300 DPI, 200 DPI, 150 DPI... Some PDFs have an OCR text overlay, so that you can copy text out of the images, or search for text across many PDF documents.

This is a typical situation you may encounter in a small business or non-profit organisation. Buying a new disk maybe cheaper, but nowhere near as satisfying as OCR'ing and optimising the size of all existing scanned documents. And it shouldn't be hard, because you cannot be the only person facing this problem, can you?

If you search the Internet for OCR and PDF optimisation software, you will find that many are commercial offerings. In the free software world, your only practical options are Ghostscript and OCRmyPDF. All other projects seem to have major limitations and/or seem to be no longer actively maintained, at least as of January 2022.

Ghostscript
Ghostscript is often recommended to reduce the size of PDF files (to "optimise" them), but it is not actually a PDF optimiser. Ghostscript can convert one PDF to another with its pdfwrite device, but during the rewrite process, some contents can be transformed or even lost. Such changes or losses are normally not a problem with PDFs coming from document scanners though.

Documents are often scanned with a high resolution for OCR purposes. Afterwards, you may want to reduce the resolutions of all embedded images, in order to reduce the disk space requirements for long-time archival purposes. And that is where Ghostscript can help:

gs -sDEVICE=pdfwrite \ -dNOPAUSE \ -dBATCH \ -dPDFSETTINGS=/ebook \ -dColorConversionStrategy=/LeaveColorUnchanged \ -sOutputFile=OutputFile.pdf \ InputFile.pdf

The 'ebook' setting presets a myriad of other options designed to generate output with 150 DPI resolution for colour and grayscale images, and 300 dpi for monochrome images. However, the preset for ColorConversionStrategy may cause unwanted colour space conversion and increase the resulting image size, so we correct it by overriding that preset value with LeaveColorUnchanged.

Normally, settings like ColorImageDownsampleThreshold are preset to 1.5, which means that colour images of up to 225 dpi will be passed through without re-encoding. Images with a higher resolution will be downsampled and re-encoded.

If you want to inspect all option values that Ghostscript will be using, append the following to the gs command above:

-c "currentpagedevice {exch ==only print === } forall" | sort

Its biggest problems with Ghostscript, as of version 9.55.0 from 2021-09-27, are:

JBIG2 always downgraded to CCITT
Ghostscript implements pass-through for embedded images that it is not going to alter, see options PassThroughJPEGImages and PassThroughJPXImages. Whether an image is going to be altered depends on options like ColorImageDownsampleThreshold.

The trouble is, pass-through is only implemented for JPEG and JPEG 2000 images, and not for other types. Black-and-white images, very common when scanning text documents, are often compressed with JBIG2 (which has a lossless mode too), and there is no pass-through for that kind.

Ghostscript only supports decompressing JBIG2, and not compressing it. Therefore, it will always decompress JBIG2 and recompress as CCITT. Effectively, Ghostscript will always downgrade all JBIG2 images to CCITT.

This is not just a shortcoming when optimising existing PDFs: Ghostscript will simply miss this optimisation opportunity when creating new PDF files. With my very limited testing, I have seen a typical size reduction between 20% and 30%, but some sources report bigger gains.

Second Deflate Image Filter Dropped
The PDF standard allows an image stream to be processed by 2 filters. When scanning text documents, the combination /FlateDecode/DCTDecode (zip+JPEG) is pretty common.

You would not normally expect a compressor like zip to further reduce the size of a JPEG image, but sometimes, it is the case. I have seen it myself with a small PDF test document I created with a commercial OCR program. With my very limited testing, I have seen a typical reduction of about 10%, and some sources report compression gains of up to 15%. One test case that yields 17 % is here. The following article explains the matter:

https://kb.itextpdf.com/home/it7kb/faq/how-to-add-jpeg-images-that-are-multi-stage-filtered-dctdecode-and-flatedecode

Ghostscript does not support using a second Deflate image filter, so:
 * If it is encoding a JPEG image (for example, after resampling), it will miss that size optimisation opportunity.
 * If it is passing a JPEG through, if will effectively remove that size optimisation.

Again, this is not just a shortcoming when optimising existing PDFs. Ghostscript will simply miss this optimisation opportunity when creating new PDF files.

Disappointment
I am disappointed with Ghostscript. The lack of a) JBIG2 compression, and b) a second Deflate image filter for JPEG, means bigger file sizes than competing commercial products for no good reason, and not just when optimising existing PDFs.

A pass-through for images encoded with JBIG2 is not really hard to implement.

Ghostscript is pretty well documented at a detailed technical level, but its documentation does not really tell you what you need to know in a concise, understandable way for an end user. There is not even a "known problems" section to summarise the most important shortcomings, which is always a bad sign.

The Ghostscript team is aware that this unexpected increase in PDF size can create head scratching, and that their current implementation is suboptimal, as it was discussed in this mailing list thread:


 * Questions about PDF file size optimisation

OCRmyPDF
OCRmyPDF is a very versatile tool designed for scanned documents: You can do all operations at once, or just some of them.
 * It can create a PDF from a scanned picture.
 * It can add an OCR text overlay to an existing PDF.
 * It can optimise the size of an existing PDF file.

OCRmyPDF uses Tesseract among many other tools to do the heavy lifting.

OCRmyPDF is robust in the face of problematic PDF files. It supports JBIG2 encoding and usually does a better job optimising PDF file size than Ghostscript alone. And it has sensible defaults.

Its documentation is also pretty good. But it is missing a "known problem" sections, which is always a bad sign, and it does not tell you about the most common shortcomings that you will probably encounter straight away. And there are plenty of them, as of version 13.3.0 from January 2022.

Cannot Reduce the Resolution of Pictures
Normally, you want to scan with 300 DPI (or perhaps more), do OCR, and then reduce the resolution to 150 DPI or less. This gives you optimal OCR accuracy and file size.

Ghostscript can be told to downsample images over a certain resolution threshold, but OCRmyPDF does not support such an obvious feature.

Increasing the JPEG compression level can be better than reducing resolution, but OCRmyPDF does not provide enough flexibility in this respect. So 300 DPI pictures will remain bigger than their 150 DPI downsampled versions. See for example GitHub Issue Feature request: additional post-processing options, and this comment.

Cannot Optimise Images That Use 2 Filters
Later note: This has been fixed in OCRmyPDF version 13.4.0.

The PDF standard allows an image stream to be processed by 2 filters. When scanning text documents, the combination /FlateDecode/DCTDecode (zip+JPEG) is pretty common.

Such images are OCR'ed correctly. However, OCRmyPDF's optimiser skips such images silently. Well, you do get a warning if you pass option '--verbose', but then the log output too noisy to be useful, so you are bound to miss such a shortcoming, as I did initially.

See GitHub Issue Support multiple image filters, or at least the common combination /FlateDecode/DCTDecode.

Cannot Optimise with a Second Deflate Filter after JPEG
For information about how much optimisation potential is being missed here, see the similar Ghostscript issue above.

Other Quality Issues
OCRmyPDF has other quality issues:


 * Even if you disable OCR with --tesseract-timeout=0, it still runs Tesseract to query its features. See GitHub Issue Add command to skip all processing related to OCR.


 * The script has a high start-up time. 300 ms have been observed in GitHub Issue Allow reusing the Docker container.


 * The script can only process 1 PDF at at time, which can make its high start-up time problematic. See feature request Process several files in a single invocation.


 * The provided Docker container image has an even higher start-up time. An additional overhead of 600 ms has been observed in GitHub Issue Allow reusing the Docker container. Later note: The container instance can be reused with a workaround that I have described in the GitHub Issue.

No Stable Free Software for Mixed Raster Content
Mixed Raster Content is a technique that can greatly increase compression efficiency for documents that contain large portions of text. During my very limited testing, I have seen scanned documents reduced to 1/3 of their original size.

Commercial scanning and OCR software normally implement MRC, but most free software like Ghostscript and OCRmyPDF do not support it.

Project Internet Archive PDF tools does provide a free-software implementation for PDF documents, but it is focused on creating new PDF files. "Recoding" an existing PDF is marked as a "not well tested feature" and has limitations.

Project didjvu provides another free-software implementation, but it is not geared to PDF files.