Bibliofillip

Sunday, July 24, 2016

Some impressions from having finished read Finn Brunton's Spam: A Shadow History of the Internet (see also the New Books Network interview):

I suspect that there is a strong difference between early e-mail spam, more recent forms of content spam, and even Philip Parker's book spam (which Brunton footnotes in the conclusion and mentions in the NBN interview) on the one hand and DDoS attacks, electronic warfare, and malware on the other. Though this latter group of threats may be spread by spam, they can in no way be interpreted as providing any benefit for the gullible or less educated computer users who constituted the traditional audience of spammers.

After all, as Brunton emphasized, during the earlier years of spam people who bought into scams often received a product or service in exchange for their payment, however poor or degraded in quality it was. People who buy automatically generated books drawn from information on the surface web also receive some kind of information for their purchase. However, the only beneficiaries of malware or electronic warfare are the parties who commission the attacks. In its nature this type of spam shares more in common with 419 attacks in that the recipients can only be victims. Thus, we need to distinguish between spam as shady marketing of substandard products and services and spam as pure scam and crime. Can we use the same word for both of these things?

I also wanted to note that I suspect that e-mail spam is still being categorized using rules other than Bayesian filters. For example, I work as a Russian translator with a number of legitimate companies in Russia and eastern Europe. However, I have to set up filters in both Gmail as well as in my professional e-mail (Fastmail) to avoid having certain messages tagged as spam just because they come from a .ru address or contain Cyrillic characters. There would appear to be nothing else about these messages that would justify classifying them as spam otherwise.

Wednesday, October 31, 2012

CrashPlan+ Online Backup Review

For almost a year now I have been using CrashPlan+ as my online/offsite backup solution. One of the key advantages behind CrashPlan as compared to many other online backup services is the fact that it offers unlimited backup for a relatively modest fee. Mozy, another service that I once used, used to offered unlimited backup for a flat fee as well, but it discontinued this service in early 2011. With CrashPlan+ I am able to store upwards of 3 TB of photos (including scans of film and slides belonging to my grandparents and going back decades), home movies, PDFs, documents and other data for $119.00 per year. By contrast, to store the same amount of information with the Amazon S3 service (another backup service where you must pay by the GB) would cost $4,792.32 per year (according to this calculator), which is somewhat beyond my budget, to say the least.

Of course, Amazon S3 is a professional-grade service, whereas CrashPlan+ is intended for consumers, so it may not be fair to compare them directly. One of the sacrifices of using a consumer service is being restricted to slow upload speeds. Indeed, though I was able to physically send the first 1 TB of my backup on a Lacie external hard drive to CrashPlan for a small additional fee at the end of last year, I have been spending the last 9 months finishing the upload of the rest of the 2 TB over the Internet. Granted, the slow, gradual initial upload is not just the fault of CrashPlan. I am using a fairly average Cox cable broadband subscription that permits me to transfer only 200 GB per month. (However, I should note that the author of another review had an even fancier Internet connection allowing up upload speeds of up to 100 mbps, and yet CrashPlan restricted him to upload speeds to no more than 3 mbps).

I have generally been happy with CrashPlan+'s performance in allowing me to restore files. Though I have not had occasion to request large chunks of my data back (I have on-site Time Machine backups on a Drobo that I have been relying on for that), I have been able to restore individual files from the cloud with ease. CrashPlan claims to keep unlimited versions of my files going back to the first backup, and I have used this feature to find older edits of files that my local Time Machine archive seems to have missed. This has been immensely convenient. However, I do notice that from time to time I cannot access CrashPlan's servers on demand. Sometimes I have had to wait an hour or so before being able to login and find my files. Apparently, around-the-clock access to data is not a guaranteed feature.

I have also been concerned by the accusations of others online that CrashPlan has lost their files. Note, for example, this review that shows up on the first page of Google results for "CrashPlan". The person here lost data due to "human error" at one of the company's data centers in Minneapolis. However, apparently this was an isolated incident, and now measures are in place to prevent anything similar from happening in the future.

I would hope this is true, because currently there is nothing that matches CrashPlan's price point and the unlimited backup. I also do feel some assurance in the fact that even if I cannot practically re-download all of my data in case I lose all access to physical copies of my information, I can pay to have the data sent back to me on hard drives (the "Restore to your Door" service).

In any case, CrashPlan is far from being my only backup solution since I do have local backup. When it comes to data backup, there is security in the number of copies you make and the diversity of backup solutions you use. Indeed, in the world of library science, there is a offsite digital preservation platform entitled LOCKSS, or "Lots of Copies Keeps Stuff Safe", which is certainly a principle born out by historical experience. Information that has been kept in one library or one place has been much more likely to perish due to man-made or natural causes (note the fire that destroyed the Library of Alexandria), whereas information that is disseminated to multiple places has a much higher chance of survival (note what the printing press was able to do for the dissemination of Martin Luther's translation of the Bible, which, if it existed in only one copy, could have easily been seized by the Catholic Church and destroyed).

Monday, January 24, 2011

Hope for all-purpose document organization software?

I have recently been considering the enormous issue of whether there any programs that can effectively organize and store metadata for the whole range of born-digital (e.g., Word documents) and digitized (scanned) documents that I manage, or whether it is a necessary evil to use different software for all the different genres of information. Scholarly articles and books stored as PDFs seem to be best handled and organized by software (e.g., Zotero and Mendeley) that is designed to support all the metadata required for the specialized citation formats that researchers use. However, a completely different type software (e.g., EverNote) has arisen to address the organization of general “personal” or “business” information (tax records, notebooks, receipts, and other records). Unlike what is true for reference management software, it is much less clear here what the metadata fields should be.

For example, EverNote only offers the user two basic fields with which to annotate a “note” (to which PDFs and other documents can be attached), namely a custom tag and a URL. Zotero and Mendeley offer a plethora of fields, and researchers frequently clamor for the addition of yet more fields to support their target venue for publication. I read this contrast to mean that information meant for private consumption is to be organized according to the idiosyncratic preferences of the individual user. No one else need to understand my tagging system if I am ultimately the only person who will use this information. Information meant for public consumption needs to follow interchangeable standards.

However, at the same time it is also clear that personal collections of information can become more organized and even more meaningful by adopting public standards. Clearly many people enjoy organizing their private libraries using LibraryThing, which can copy Library of Congress cataloging associated with a given book when users add it to their library while at the same time allowing users to add their personal tags. My hope is that a piece of software like EverNote could adopt more fields, if only the 15 generalized ones agreed upon in the Dublin Core metadata standard (e.g., “Creator,” “Title,” “Subject,” etc.), so that it could better fulfill the need for a overall documentary organization software. I think if I could do this, it would better allow me to use some of the nifty cloud-computing features built into this kind of software. For example, both Mendeley and EverNote publish iPad and iPhone apps that allow for viewing one’s entire synchronized library of documents. However, it would be rather unwieldy (both space- and bandwidth-wise) to synchronize to different databases organized by different programs.

Saturday, September 25, 2010

The TeX engine as a solution for dynamically typesetting ebooks

During the last month I watched a couple of conference presentations (i.e., William Cheswick on TeX and the iPad and Kaveh Bazargan on TeX as an ebook reader) that discussed the possibility of using the TeX typesetting system on the current generation of e-book reading devices, and in particular on the iPad. LaTeX in particular has traditionally been used to typeset mathematical, scientific and technical publications for electronic and print distribution. (TeX, the base engine for which LaTeX is a front-end markup language, was invented by Donald Knuth over thirty years ago in order to address the problem of typesetting equations for his The Art of Computer Programming). But the language more than adequately handles typesetting for books in the humanities and social sciences, and many reviewers believe that TeX kerns words better than Adobe Pagemaker/InDesign or Quark Xpress. (Notably, you will be able to see from the copyright page that Cambridge University Press uses LaTeX to typeset many of its more recently published books.)

What I find really interesting about Cheswick's and Bazargan's proposals is that they try to solve one of the fundamental problems that has confronted publishers of electronic texts. Unlike the static PDF files that InDesign and Quark produce, which fix forever a document's pagination and fonts, TeX is capable of dynamically generating pretty-print text in order to fit different orientations for an ebook reader, or to accommodate a reader's preference for a larger font size (which means, in essence, that TeX instantly generates a new DVI or PDF file as needed). Of course, one of the traditional strengths of electronic texts (such as the plain text ebooks that one can download from Project Gutenberg) has been this kind of plasticity: it is easy to open a TXT file in a word processor and to customize it to one's heart's content. But as anyone who has tried to read a very long TXT document on their computer knows, these texts are not very pretty. The standard kerning and tracking between characters, especially for a basic monospace font, is very crude. Plain ASCII text also has no support for a host of typographical conventions that have informed how we have read the codex book for the last five centuries, including footnotes, sidenotes, glosses and various textual ornaments. This is why typeset PDFs are preferable in many ways for electronic reading. But these texts have never been very plastic; even zooming in on a page to make the font bigger entails constant panning from left to right and up and down. This can be particularly tedious if you are reading on a small screen, such as that of an iPhone or iPod Touch.

It is great that there are researchers who are thinking about how to achieve the best of both worlds. In a way this use of TeX is an extension of HTML, since that markup language has also in a more limited way supported balancing page design features (such as tables, different fonts or block quotes) with the ability to dynamically reset the text in order to fit between the vertical boundaries of a web-browser window. (Incidentally, I have learned in my own experience formatting ebooks for the Amazon Kindle that the best results are achieved by submitting a file in HTML for conversion. I would thus not be surprised if Amazon's proprietary AZW format was using something very close to HTML for formatting its books. This also explains why the Kindle presents a fairly decent typographical reading experience: HTML is more competent for reading than plain text. But this also explains why Kindle books are not as pretty as the text, say, in a Folio Society book)

I found it interesting that Bazargan said that his implementation of TeX on the iPhone/iPod Touch did not support pagination. It is clear from the presentation that the software typesets a document as one long page, so that the reader can freely scroll and up and down the document. This design decision solves the problems implicit in letting software, no matter how smart, automatically break text, figures or equations over two pages. From my experience using InDesign, I can say that it does take a human eye in order to decide how best to set text around a page break.

Tuesday, September 14, 2010

The problem of making searchable PDFs

It has been a very laborious process trying to discover a free (or at least cheap) solution for making image-based PDFs searchable using software that runs in either Mac OS X or Linux. This is a particularly pressing need for me given the number of books and other paper-based documents that I scan on a regular basis. Interestingly, the packaged software that came with my flatbed scanner,a Canoscan LIDE 70, was able to effortlessly add a text-layer to my scans under Windows XP. However, since I changed computers and operating systems, I have been using VueScan as my scanning app. While the version of this software (8.6.23) that I have been using has the ability to OCR text and write the output to a TXT file, it is not able to produce searchable PDFs. (I just noticed that a newer version (8.6.33) released this past May actually does add support for creating searchable PDFs. I will definitely download this. I should also note in passing that VueScan adds functionality that Canon's packaged drivers and software lacked, such as the ability to operate continuously through a multi-page scan, eliminating the need to constantly hit the scan button).

In any case, I need a solution for converting the numerous files that I have already produced that are simply image-based. My goal has been to find a way out of buying an expensive OCR and PDF creation suite, such as OmniPage Pro or ABBYY Finewriter, which can create searchable PDFs. Most of the free software that I have been able to find through Googling has been designed to work from the Linux command-line. I am willing to use this software as solution because I have an older laptop on which I have installed Ubuntu 9.10, and I am not against shuffling PDFs between my MacBook Pro and the machine in order to post-process my scans. (This workflow seems also to be the engineering solution of choice, especially in larger networked settings, since there is a Live-CD based Linux distro designed just for handling this task).

The first software that I tried was pdfocr. I was able to successfully install all the necessary packages. I was initially encouraged that the software processed the first PDF that I fed it page-by-page without balking. However, the script constantly complained that each page image was not at an anticipated resolution of 300 DPI. There does not seem to be a command-line variable which allows for this variable to be changed. (Most of the book scans that I have done are at 150 DPI, mostly because this resolution is usable for screen reading and it speeds up the scanning process. At resolutions of 300 DPI and above the scanning head on my scanner simply crawls). The final output was disappointing. Though pdfocr successfully added an OCR layer to each page, the underlying text was set at way too many points and thus out of all proportion to the image text. This layer is not at all usable either for highlighting using PDF annotation software or for searching to find where a word or phrase specifically occurs.

The second command-line based software that I tried, a custom bash script described in this blog post, suffered from the same problem. This script also uses both the same OCR engine, Cuneiform, and OCR data format, hOCR, as the first software I tried. This tells me that whatever its OCR accuracy, Cuneiform and hOCR may not be suitable for this application. At the very least, a programmer with more knowledge than me needs to create more robust options in order to work with my set of files.

Given that Google Book search is able to use its Tesseract OCR software to produce accurate (and accurately placed) text data for page scans, it should not be that difficult to find a free and efficient solution to use on my own computer.

Friday, August 20, 2010

Review of my first Kindle, the Kindle 1

In 2007 I received a first-generation Kindle. In summary, I would say that the Kindle matched the functionality of the Rocket eBook while adding a host of technological improvements and one key feature that NuvoMedia could never muster (i.e., seamless integration with an online bookstore). In fact, the free, always-on connection to the cellphone tower gave the device access not only to the Kindle Store but also the World Wide Web. Granted, the browser in the device was very crude -- suitable for displaying text-based websites only. This made the Kindle a very good Wikipedia reader, for example. The programmers included shortcuts in the search system that made using the web browser in this way easier. For example, prefacing your search with the term "@wiki" would search Wikipedia for a specified term and automatically load the most relevant article. Similarly, "@web" allowed for quick Google searches.

The Kindle was most special for utilizing E-Ink for its display technology. The screen went a long way toward relieving eyestrain by mimicking the properties of a printed page. Unlike CRT or LCD computer monitors, which project light out to you, ambient light illuminates the E-Ink display. This is why a Kindle reads very well in direct sunlight or under a reading light. Of course, in darkness it can be a hassle to always have a reading light. Perhaps one of the advantages of an older e-reader, like the Rocket eBook, or even a laptop, is that they provide their own backlight illumination.

The Kindle was also the first e-reader that I was able to finagle into displaying foreign language texts. Mind you, this was not because the Kindle came with any native support for foreign alphabets (The Kindle 1 only supported the ISO 8859-1 (Latin 1) character set). I was only able to read Anna Karenina in Russian on my Kindle due to the fact that the device has a hidden image viewing application that can be used to display page images. Follow these directions in order to reproduce my workflow for preparing a text:

1. Download a foreign-language text in HTML or plain text.

2. Typeset it in a modern word processor (I use OpenOffice) using the custom page dimensions 3.5" X 5" (which approximates the size of the Kindle display). The margins on all sides should be .1"

3. Export a PDF of the document.

4. Use an application like PDF2PNG to create a batch set of image files from the PDF representing each page of the text. These files should be placed inside a file folder labled with the title of the work. This will be the title that displays on Kindle's main menu.

5. Drag this folder to a "pictures" folder on the Kindle.

5. Press the keys "Alt-Z" while at the home screen to make the book you added appear in the list of available reading matter.

Unlike what was true of the Rocket eBook, the Kindle made it easy to extract your textual annotations to your computer for use in other applications. All annotations were collected into a plain text file that could easily be copied to the computer when the Kindle was attached via USB port. As of last year it also became possible to sync and view these annotations online at Amazon's website. Of course, this is not the same as being able to transfer text and annotations together and, in turn, view them together outside of the device. I do not think that these ways of recording and presenting notes compare favorably to what is possible with good PDF annotation software on a computer (see my earlier post).

I should also mention that the Kindle was a much more flimsy device than the Rocket eBook. In actual fact I broke the screen of my first Kindle within weeks of receiving it (I had mistakenly placed the device under a heavy book which cracked the screen). Thankfully, Amazon replaced the device free of charge. The second device that I was then sent in early 2008 has lasted to the present. However, I have had to replace the battery once, and most recently the modem has started to work only intermittently, forcing me to use the e-reader via USB if I want to be able to reliably transfer documents and books.

I no longer use this Kindle as my primary e-book reader, having purchased late last year a Kindle 2. However, I will not discuss this device separately since it has many of the same features and functionality as the Kindle 1.

Saturday, August 14, 2010

Remembering the Rocket eBook, the true pioneer of eBook readers

I was thinking that this blog would be an appropriate venue to discuss eBook readers, especially since in recent years they have really started to come into their own as separate appliances. Certainly it could be argued that these devices have reached a tipping point in the mass consciousness. I have actually used an eBook reader of one sort or another on and off for the last ten years. For much of that time I used a Rocket eBook 1000. This device was by no measure common and really did not gain a wide following. It appears that the page that I linked to is an advertisement from circa 2000 (I would link to a Wikipedia article, but there is none). It is amusing that the page boasts that the now defunct NuvoMedia has sold "tens of thousands" of the reader. Note that Amazon has sold three or four million Kindles, and this is also supposedly a niche device for serious readers.

In late high school and early college I used the reader to take advantage of Project Gutenberg public domain texts. Especially at that time, reading a whole book on a curved, CRT monitor was a much more daunting prospect than reading on a modern, flat, high-resolution LCD screen. The reader's low-resolution black-on-green display was as good as a Palm Pilot's, and yet the screen was large enough (as large as the Kindle's, in fact) to be able to read comfortably for hours at a time.

My Rocket eBook was the way in which I read all of the Constance Garnett translations of Russian literature, including War and Peace, Anna Karenina, The Gambler, Crime and Punishment and Dead Souls. I made many annotations and underlined just as many passages from these works. The only problem was that at the end of the reader's life it was difficult to transfer this information back to my computer. For that matter, it was difficult getting any information, including the actual books themselves, off the device. Naturally, the reader was not very good for any kind of reading where one could expect to incorporate annotations into a Word document on a computer, for example.

The device could display only ASCII text. This means that trying to use the reader for reading anything but English-language texts was nigh impossible. After I started learning Russian I racked my brains trying to figure out a way to trick the device into displaying Cyrillic. (Since the reader could display GIF-based images, I even experimented converting pages of Russian text into small image files. This, alas, did not really work very well. I will talk about how I implemented this solution on my first-generation Kindle in my next post).

The Rocket eBook anticipated Apple's current generation of mobile devices by basing the whole interface around a touch screen. You selected text with the stylus in order to made underlines, and tapped an on-screen keyboard to enter notes. (The handwriting recognition, like the Palm's, was truly awful). And also like the iPad, iPhone or iPod Touch, the device could display text either in portrait or landscape modes.

For its time the Rocket eBook was a very nice appliance. It was built using hard plastics that I do not see in many consumer electronics today. The fact that it survived from 2000 to 2007 through near daily use speaks to the quality of its construction. (The fact that the screen showed nary a scratch after seven years of tapping and dragging with the stylus is perhaps more impressive). I only retired it because I received a Kindle for Christmas 2007. I fetched a handsome price for the Rocket eBook when I sold it on eBay (the reader does indeed have a small following of devoted fans), and the lady who won the auction wrote me an email afterwards describing how much she loved her first Rocket eBook.