Saturday, September 25, 2010

The TeX engine as a solution for dynamically typesetting ebooks

During the last month I watched a couple of conference presentations (i.e., William Cheswick on TeX and the iPad and Kaveh Bazargan on TeX as an ebook reader) that discussed the possibility of using the TeX typesetting system on the current generation of e-book reading devices, and in particular on the iPad.  LaTeX in particular has traditionally been used to typeset mathematical, scientific and technical publications for electronic and print distribution.  (TeX, the base engine for which LaTeX is a front-end markup language, was invented by Donald Knuth over thirty years ago in order to address the problem of typesetting equations for his The Art of Computer Programming).  But the language more than adequately handles typesetting for books in the humanities and social sciences, and many reviewers believe that TeX kerns words better than Adobe Pagemaker/InDesign or Quark Xpress.  (Notably, you will be able to see from the copyright page that Cambridge University Press uses LaTeX to typeset many of its more recently published books.)

What I find really interesting about Cheswick's and Bazargan's proposals is that they try to solve one of the fundamental problems that has confronted publishers of electronic texts.  Unlike the static PDF files that InDesign and Quark produce, which fix forever a document's pagination and fonts, TeX is capable of dynamically generating pretty-print text in order to fit different orientations for an ebook reader, or to accommodate a reader's preference for a larger font size (which means, in essence, that TeX instantly generates a new DVI or PDF file as needed).  Of course, one of the traditional strengths of electronic texts (such as the plain text ebooks that one can download from Project Gutenberg) has been this kind of plasticity: it is easy to open a TXT file in a word processor and to customize it to one's heart's content.  But as anyone who has tried to read a very long TXT document on their computer knows, these texts are not very pretty.  The standard kerning and tracking between characters, especially for a basic monospace font, is very crude.  Plain ASCII text also has no support for a host of typographical conventions that have informed how we have read the codex book for the last five centuries, including footnotes, sidenotes, glosses and various textual ornaments.  This is why typeset PDFs are preferable in many ways for electronic reading.  But these texts have never been very plastic; even zooming in on a page to make the font bigger entails constant panning from left to right and up and down.  This can be particularly tedious if you are reading on a small screen, such as that of an iPhone or iPod Touch.

It is great that there are researchers who are thinking about how to achieve the best of both worlds.  In a way this use of TeX is an extension of HTML, since that markup language has also in a more limited way supported balancing page design features (such as tables, different fonts or block quotes) with the ability to dynamically reset the text in order to fit between the vertical boundaries of a web-browser window.  (Incidentally, I have learned in my own experience formatting ebooks for the Amazon Kindle that the best results are achieved by submitting a file in HTML for conversion.  I would thus not be surprised if Amazon's proprietary AZW format was using something very close to HTML for formatting its books.  This also explains why the Kindle presents a fairly decent typographical reading experience: HTML is more competent for reading than plain text.  But this also explains why Kindle books are not as pretty as the text, say, in a Folio Society book)

I found it interesting that Bazargan said that his implementation of TeX on the iPhone/iPod Touch did not support pagination.  It is clear from the presentation that the software typesets a document as one long page, so that the reader can freely scroll and up and down the document.  This design decision solves the problems implicit in letting software, no matter how smart, automatically break text, figures or equations over two pages.  From my experience using InDesign, I can say that it does take a human eye in order to decide how best to set text around a page break.

Tuesday, September 14, 2010

The problem of making searchable PDFs

It has been a very laborious process trying to discover a free (or at least cheap) solution for making image-based PDFs searchable using software that runs in either Mac OS X or Linux.  This is a particularly pressing need for me given the number of books and other paper-based documents that I scan on a regular basis.  Interestingly, the packaged software that came with my flatbed scanner,a Canoscan LIDE 70, was able to effortlessly add a text-layer to my scans under Windows XP.  However, since I changed computers and operating systems, I have been using VueScan as my scanning app.  While the version of this software (8.6.23) that I have been using has the ability to OCR text and write the output to a TXT file, it is not able to produce searchable PDFs.  (I just noticed that a newer version (8.6.33) released this past May actually does add support for creating searchable PDFs.  I will definitely download this.  I should also note in passing that VueScan adds functionality that Canon's packaged drivers and software lacked, such as the ability to operate continuously through a multi-page scan, eliminating the need to constantly hit the scan button).

In any case, I need a solution for converting the numerous files that I have already produced that are simply image-based.  My goal has been to find a way out of buying an expensive OCR and PDF creation suite, such as OmniPage Pro or ABBYY Finewriter, which can create searchable PDFs.  Most of the free software that I have been able to find through Googling has been designed to work from the Linux command-line.  I am willing to use this software as solution because I have an older laptop on which I have installed Ubuntu 9.10, and I am not against shuffling PDFs between my MacBook Pro and the machine in order to post-process my scans.  (This workflow seems also to be the engineering solution of choice, especially in larger networked settings, since there is a Live-CD based Linux distro designed just for handling this task).

The first software that I tried was pdfocr.  I was able to successfully install all the necessary packages.  I was initially encouraged that the software processed the first PDF that I fed it page-by-page without balking.  However, the script constantly complained that each page image was not at an anticipated resolution of 300 DPI.  There does not seem to be a command-line variable which allows for this variable to be changed.  (Most of the book scans that I have done are at 150 DPI, mostly because this resolution is usable for screen reading and it speeds up the scanning process.  At resolutions of 300 DPI and above the scanning head on my scanner simply crawls). The final output was disappointing.  Though pdfocr successfully added an OCR layer to each page, the underlying text was set at way too many points and thus out of all proportion to the image text.  This layer is not at all usable either for highlighting using PDF annotation software or for searching to find where a word or phrase specifically occurs.

The second command-line based software that I tried, a custom bash script described in this blog post, suffered from the same problem.  This script also uses both the same OCR engine, Cuneiform, and OCR data format, hOCR, as the first software I tried.  This tells me that whatever its OCR accuracy, Cuneiform and hOCR may not be suitable for this application.  At the very least, a programmer with more knowledge than me needs to create more robust options in order to work with my set of files.

Given that Google Book search is able to use its Tesseract OCR software to produce accurate (and accurately placed) text data for page scans, it should not be that difficult to find a free and efficient solution to use on my own computer.