In any case, I need a solution for converting the numerous files that I have already produced that are simply image-based. My goal has been to find a way out of buying an expensive OCR and PDF creation suite, such as OmniPage Pro or ABBYY Finewriter, which can create searchable PDFs. Most of the free software that I have been able to find through Googling has been designed to work from the Linux command-line. I am willing to use this software as solution because I have an older laptop on which I have installed Ubuntu 9.10, and I am not against shuffling PDFs between my MacBook Pro and the machine in order to post-process my scans. (This workflow seems also to be the engineering solution of choice, especially in larger networked settings, since there is a Live-CD based Linux distro designed just for handling this task).
The first software that I tried was pdfocr. I was able to successfully install all the necessary packages. I was initially encouraged that the software processed the first PDF that I fed it page-by-page without balking. However, the script constantly complained that each page image was not at an anticipated resolution of 300 DPI. There does not seem to be a command-line variable which allows for this variable to be changed. (Most of the book scans that I have done are at 150 DPI, mostly because this resolution is usable for screen reading and it speeds up the scanning process. At resolutions of 300 DPI and above the scanning head on my scanner simply crawls). The final output was disappointing. Though pdfocr successfully added an OCR layer to each page, the underlying text was set at way too many points and thus out of all proportion to the image text. This layer is not at all usable either for highlighting using PDF annotation software or for searching to find where a word or phrase specifically occurs.
The second command-line based software that I tried, a custom bash script described in this blog post, suffered from the same problem. This script also uses both the same OCR engine, Cuneiform, and OCR data format, hOCR, as the first software I tried. This tells me that whatever its OCR accuracy, Cuneiform and hOCR may not be suitable for this application. At the very least, a programmer with more knowledge than me needs to create more robust options in order to work with my set of files.
Given that Google Book search is able to use its Tesseract OCR software to produce accurate (and accurately placed) text data for page scans, it should not be that difficult to find a free and efficient solution to use on my own computer.
No comments:
Post a Comment