Tuesday, September 14, 2010

The problem of making searchable PDFs

It has been a very laborious process trying to discover a free (or at least cheap) solution for making image-based PDFs searchable using software that runs in either Mac OS X or Linux.  This is a particularly pressing need for me given the number of books and other paper-based documents that I scan on a regular basis.  Interestingly, the packaged software that came with my flatbed scanner,a Canoscan LIDE 70, was able to effortlessly add a text-layer to my scans under Windows XP.  However, since I changed computers and operating systems, I have been using VueScan as my scanning app.  While the version of this software (8.6.23) that I have been using has the ability to OCR text and write the output to a TXT file, it is not able to produce searchable PDFs.  (I just noticed that a newer version (8.6.33) released this past May actually does add support for creating searchable PDFs.  I will definitely download this.  I should also note in passing that VueScan adds functionality that Canon's packaged drivers and software lacked, such as the ability to operate continuously through a multi-page scan, eliminating the need to constantly hit the scan button).

In any case, I need a solution for converting the numerous files that I have already produced that are simply image-based.  My goal has been to find a way out of buying an expensive OCR and PDF creation suite, such as OmniPage Pro or ABBYY Finewriter, which can create searchable PDFs.  Most of the free software that I have been able to find through Googling has been designed to work from the Linux command-line.  I am willing to use this software as solution because I have an older laptop on which I have installed Ubuntu 9.10, and I am not against shuffling PDFs between my MacBook Pro and the machine in order to post-process my scans.  (This workflow seems also to be the engineering solution of choice, especially in larger networked settings, since there is a Live-CD based Linux distro designed just for handling this task).

The first software that I tried was pdfocr.  I was able to successfully install all the necessary packages.  I was initially encouraged that the software processed the first PDF that I fed it page-by-page without balking.  However, the script constantly complained that each page image was not at an anticipated resolution of 300 DPI.  There does not seem to be a command-line variable which allows for this variable to be changed.  (Most of the book scans that I have done are at 150 DPI, mostly because this resolution is usable for screen reading and it speeds up the scanning process.  At resolutions of 300 DPI and above the scanning head on my scanner simply crawls). The final output was disappointing.  Though pdfocr successfully added an OCR layer to each page, the underlying text was set at way too many points and thus out of all proportion to the image text.  This layer is not at all usable either for highlighting using PDF annotation software or for searching to find where a word or phrase specifically occurs.

The second command-line based software that I tried, a custom bash script described in this blog post, suffered from the same problem.  This script also uses both the same OCR engine, Cuneiform, and OCR data format, hOCR, as the first software I tried.  This tells me that whatever its OCR accuracy, Cuneiform and hOCR may not be suitable for this application.  At the very least, a programmer with more knowledge than me needs to create more robust options in order to work with my set of files.

Given that Google Book search is able to use its Tesseract OCR software to produce accurate (and accurately placed) text data for page scans, it should not be that difficult to find a free and efficient solution to use on my own computer.

No comments:

Post a Comment