OCR'ing and Optimising Scanned PDF Documents

Say that you are faced with a large collection of scanned documents that has been growing over years, and the disk is now full.

Those scanned files have been created with a mixture of simple scanners, old software, and more advanced, proprietary scanning software. Some sheets of paper have been scanned as plain JPEG or TIFF bitmaps. Some are PDF files with one such large bitmap per page, others use the Mixed Raster Content compression technique. The scan resolution is different too: 300 DPI, 200 DPI, 150 DPI... Some PDFs have an OCR text overlay, so that you can copy text out of the images, or search for text across many PDF documents.

This is a typical situation you may encounter in a small business or non-profit organisation. Buying a new disk maybe cheaper, but nowhere near as satisfying as OCR'ing and optimising the size of all existing scanned documents. And it shouldn't be hard, because you cannot be the only person facing this problem, can you?

If you search the Internet for OCR and PDF optimisation software, you will find that many are commercial offerings. In the free software world, your only practical options are Ghostscript and OCRmyPDF. All other projects seem to have major limitations and/or seem to be no longer actively maintained, at least as of January 2022.

[more content to be written later on]