Google and the “great digitization”

I’ve been quietly reading about the Google deal with the libraries of Stanford, University of Michigan, Harvard, Oxford, and New York City, and the resultant debates/rants on various blogs. I didn’t really want to go off on a half-cocked rant of my own, so I’ve spent the last few days thinking about what Google’s digitization could mean and what objections, if any, I would have to it. I’ve seen many glowing and positive comments about the deal and many negative and apocalyptic comments. Mostly, I’ve seen cautiously optimistic but somewhat skeptical comments from the blogs I read.

To get background on the story, I’d suggest the Chronicle of Education, the New York Times, and Search Engine Watch (this one will probably be the most interesting for librarians). The gist is that Google is going to digitize works from these five libraries. Google is going to do the scanning at each library, though the library itself will be making the selection decisions. The libraries will each get their own copy of their digitized works for use at their own library. They will be primarily digitizing books out of copyright, though copyrighted works will also be digitized with only excerpts and links to where the books can be purchased. This is an extension of Google Print, and the books will be searchable through the regular Google search interface (though the digitized works will somehow be separated from other search results).

Digitization is something I’ve been interested in since before I decided to go to library school. In fact, the last paper of my academic career was on the subject of digitization. In it, I brought up many of the concerns that came to mind when I learned about Google’s digitization project.

Selection – While digitizing every book in the University of Michigan Libraries sounds fantastic, is it really necessary? Not every book is worth digitizing. Some books may not have much research value and other books may be so unpopular that they never circulate. If there is little chance that a digitized book is going to be utilized, it is certainly not worth the expense to digitize it. It just feels piggy of the Michigan Library to digitize everything without any consideration of selection criteria. There may be valuable holdings in other libraries that would make far more sense to digitize.

Additionally, Harvard says it is going to digitize 40,000 works from the Harvard Depository. These are books that are placed in storage because they are not often needed or are too fragile to be used by the public. Obviously for a trial run, Harvard is not going to give Google their most fragile books, so it seems that they are going to digitize works that people aren’t really that interested in reading. I understand that it’s a trial, but, considering the cost of digitization, wouldn’t it make more sense to digitize works people might actually be interested in using?

For a great article on the cost of digitization, see this article from RLG DigiNews.

Preservation – this is a biggie, and something I haven’t seen mentioned by Google, the libraries or the media. The digital medium is by no means a preservation medium. The average lifespan of any file format is approximately three to five years. After that time, it may be difficult to impossible to access the files on contemporary machines without some preservation measures being taken. There are a variety of methods for digital preservation, however all are expensive or time consuming, and all are incomplete solutions. Without employing some form of digital preservation, however, the files will quickly become obsolete and inaccessible. Preservation can end up costing far more than the original expense to digitize the work. Who is going to be responsible for this? Google? The Libraries? I can’t imagine the libraries didn’t think of this when dealing with Google, but I’d like to know how it’s going to be handled.

Funding – I have been reading a lot of blog entries about how horrible it is for a corporate entity to be funding and conducting such a large-scale digitization project. Mark Rosensweig in particular had a scathing review of the deal (see his entire diatribe at Free Range Librarian):

There is something absolutely mind-boggling about the ability of a single, for-profit company being able to shape, to radically re-direct, the future of a whole sphere of life. Even more so when it enlists the cooperation of the public stewards of that sphere in what amounts to a relinquishment of key elements of responsibility to a unabashedly profit-driven mega-corportation.

Dorothea at Caveat Lector had a nice rebuttal:

We leave digitization to the moneyed interests one way or another. Database aggregators (who all too often do a truly craptastic job of it). Publishers (ditto). Grants. If Google had handed the money over to the universities, would that solve Rosenzweig’s angst? (I bet it would, even though it’s the same damn money. My guess is that his real problem is feeling useless because somebody’s doing text digitization who isn’t a librarian. To which I say, I didn’t learn to do it as a librarian. Join the real world where lots of people are doing it, a few as effectively or more so than librarians.)

I can’t say I have no concerns about a mega-corporation holding all this content, but it seems like the libraries have done their homework and have not blindly made a deal with the devil. They have thought about how advertisements will be handled, who will own the content, etc. But I still am concerned about what this all will look like in the future. Sure, it would be great if Universities could afford to do all this digitization on its own, but most do not and most have had to make deals with one devil or another to get things done. This is the reality of our situation.

Privacy – the Internet is not like a library. There isn’t a pretty little code of ethics which values the patron’s right to privacy. The Internet collects information on what you read. Though this is usually for market research and advertising purposes, it has also been used to find out who is reading about bomb making and Jihads. What would stop Google or the government from using their digital library as a sting operation? How will the privacy of patrons be preserved? I think the answer is, it won’t. Sure, Google could make it so that the IP address of each user is not recorded, but I sincerely doubt they will. Rory Litwin brings up these privacy concerns (and many others) in a well-reasoned Library Juice article today.

Technology – this is where having Google involved is going to be a huge benefit. Google has the money and technology know-how to greatly improve the technologies and practices used in digitization. Google has already developed new scanners apparently, and perhaps they will help to develop more technologies that can be used by other Universities in the future. Perhaps they will help to create better file format and encoding standards (right now, libraries do not all use the same standards in digitization). What Google will learn from this project could be valuable to all libraries. I certainly hope that Google is willing to share those experiences and technologies. Dorothea also expressed some concerns about Google’s willingness to share with the other children.

All in all, I think anything that brings improved free information access to the world is a good thing. I’m quite excited about this. But I do think we need to keep our eyes wide open to what Google is doing. In spite of their noble proclamations, they are not doing this out of charity, and we should always have some skepticism when a for-profit company is involved in anything related to libraries (not just Google).

If you want to see more commentary on this development, here are some links:

For: lbr, Caveat Lector, and John Udell.

Against or with serious reservations: Mark Rosensweig via Free Range Librarian, Rory Litwin at Library Juice, and Library Law Blog.

On the fence: Resource Shelf, Research Buzz, and Blake at LISNews.