Mass Digitization of Books

Overview

The mass digitization of books refers to the ambitious large-scale project, begun early in the twenty-first century, to convert the entire published works of humanity since antiquity into digital form, that is, to move away from bound books, periodicals, and manuscripts and to make that immense body of information available worldwide through the convenience and accessibility of user-friendly digital devices. The idea behind mass digitization is both accessibility and preservation; after all, printed books and brick and mortar libraries are subject to a kind of absolute destruction that digital formats could resist, and by making these artifacts available through digital platforms reading itself will enjoy a far wider reach, in essence democratizing reading itself. Anyone with a digital device could theoretically have any book ever published literally at their fingertips. Theoretically, the mass digitization of books, when and if it can be accomplished, would create something unprecedented in human history: an authentic global culture.

The daunting task—to optically scan and copy into a reader-friendly format every book, every article, every essay, every poem, every play, every novel ever published in any language and in any era—has been, since 2004, under the auspices of a loosely related (and at times competing) group of international book services, publishers, online providers, and libraries. Concerns over the legal ramifications of digitization, in particular difficult questions of copyright protection in an era of easy reproduction and dissemination, as well as thorny questions about how to select texts for inclusion, have created significant controversies around the very idea of the project. The project has also prompted questions about the place of books in a culture's sense of itself and its history. In the United States, for example, books that defended Jim Crow legislation or that voiced hate rhetoric toward minority populations, women, immigrants, or Native Americans are, for better or worse, part of the national identity—part of the cultural legacy and cultural burden; a vetting process would have to take into account incendiary texts and consider them for inclusion or exclusion. An argument could be made to relegate problematic texts to a sort of shadow library. The digitization process, in short, is hardly neutral. Nanna Bonde Thylstrup, distinguished professor of cultural studies at the University of Copenhagen, argued in her 2018 study that mass digitization is not, indeed cannot be neutral, that digitization introduces a subtle and new way for agencies (and governments) to shape cultural memory. Thus, the availability of the process far outstrips the ability to consider the long-term ramifications of the mass digitization project.

The availability of the Internet as a resource for information and entertainment prompted early predictions that physical libraries themselves would become museums. Indeed, the long-term of digitization of books could make obsolete the concept of a library as a valued sanctuary and privileged public place, a central community forum wherein information is shelved and made available to patrons. Book stores, a billion-dollar industry worldwide, could become nostalgic throwbacks, quaint antiques and curiosities shops, reminders of how business used to be done. In addition, the mass digitization of books will inevitably reshape the concept of publishing itself and how publishers will make their profits. It may entirely restructure the relationship between author and editor and ultimately between author and reader.

It is difficult to comprehend the immensity of the digitization challenge. Although measurements are difficult because of the nature of book publishing, it was estimated that between 500,000 and 1 million books were published in the United States in 2021, with an additional 2.3 million titles that were self-published. This was made possible by the introduction of e-book and print-on-demand publishing, which enables writers to upload their works directly to online retail sites, circumventing traditional publishing houses. Projects that have as their goal digitalizing books as a way to make these resources more accessible and more convenient to use have existed really as long as digital technology itself. In the mid-1970s, Project Gutenberg began its mission to provide electronic texts of great works that were in public domain, that is, works that were no longer covered by copyright protections. The effort, staffed largely by volunteers and funded by public and private contributions, was very modest. By 2023, the digital library offered more than 70,000 individual volumes available free online.

With the introduction of e-book technologies, most notably Kindle readers in the first decade of the twenty-first century, publishers used the technology as a way to access a much wider market than could be accessed through conventional bookstores or even online purchases. As with music downloading a decade before, the introduction of e-readers introduced new issues created by the ease of pirating copies of digitized texts; new books could be downloaded for private use and then shared without compensation provided to the publisher and, by extension, to the author. Nevertheless, e-reading revolutionized the reading industry and positioned books to begin an unprecedented era of popularity. By 2023, data indicated that Amazon's Kindle Store published more than four million books. While Kindle is by far the largest provider of e-books in the United States, its competitors, including Kobo, Barnes and Noble, and Apple's iBooks, also fielded a vast number of commercial and self-published titles.

Retail numbers, however, pale in comparison with the target data set by the first (and most important) of the mass digitization projects, Google Books, which debuted in 2004. The brainchild of Google co-founder Larry Page, a giant in digital technology who is nevertheless an admitted and committed bibliophile, Google Books was launched with much fanfare as a project-in-process that would provide low-cost access to an ever-widening database of books. Predictably, the process itself, done manually in so-called scanning centers, was initially slow. A typical 200-page book would be fed a page at a time by an operator, who would have to adjust the text to center it, and take just under thirty minutes to scan and store. Google assured participating libraries that books would be carefully treated. Unlike in photocopying, which requires the text be flattened to copy, in a digital scanner the text would rest in a sort of v-shaped cradle to avoid damaging the spine. The scale of the project made headlines—Google would partner with five major libraries—the university collections of Harvard, Oxford, Michigan, and Stanford as well as the entire collection of the New York City library system. Collectively, those libraries would provide access to just over 60 percent of the entire published works of humanity. That was a staggering volume of material.

By 2023, Google Books estimated it had indexed more than 40 million books. Google Books provided a variety of levels of access and service to its users. If a key term, for instance, were fed into the system, Google Books would supply access to books that had that key term in its pages, the term highlighted in yellow. In addition, Google Books provided previews of the books, that is, full and free access to a limited number of pages of the book, to give the user a sense of its argument, its style, and its usefulness. Users would be directed to other avenues for obtaining full text, most often through a per-charge site. Google Books is then more of an index or an inventory than a traditional library, more like a massive card catalog or browsing service than a book store.

Because Google Books provided access to texts unless publishers (or authors) specifically and directly stopped them, questions about copyright protection were raised. The Google process was considered shadowy; in response, in 2005, an association of university libraries began a rival service, known as Open Content Alliance, which, as Coyle documents (2006), touted itself as a complete transparent digitizing book service that provided digital copies only after legal permission had been secured. However, the service ended in 2008.

rsspencyclopedia-20180417-62-179402.jpgrsspencyclopedia-20180417-62-179417.jpg

Further Insights

Converting books and manuscripts into digital form might seem a most benign and even useful application of digital technologies. Many individual university libraries and special collection libraries focused on a single era or a particular author, for example, have already successfully converted their holdings to digitized versions and have placed the originals in appropriate vaults for long-term preservation. These are relatively localized successes. Mass digitization, however, poses substantive objections beyond the nostalgia V. Kampoor (2017) articulated for the heft and feel of bound books or the longstanding role of public libraries in communities. There are legitimate questions over how inclusive such an archive should be or needs to be—that is, who decides on the inclusion of certain texts. As Hahn points out (2007), large-scale digitization, without any central directing agency, can create errors and perpetuate deliberate alterations in texts. A Shakespeare play or a Walt Whitman poem, for instance, can exist in multiple forms, shaped and reshaped by careless reproduction or by interfering editors or by government or religious or public agency censors. The inclusion or exclusion of any of these variants says something about a culture. Digitization potentially provides an opportunity to shape a culture's memory by selecting or erasing its texts.

More to the point, the digitization process, still tied largely as it is to manual labor, is prone to errors. Although no one disputes that digital platforms can provide access that is far cheaper than investing in bound books or paper reproductions, the process can itself raise problems. Poorly centered pages, blurred images, pages out of sequence are all byproducts of the scanning process. Although technology will no doubt improve, optical scanners using optical character recognition software, do not provide the easiest to read reproductions, especially where the physical original is in poor or deteriorating condition. In addition, the process is comparatively slow; completing the mass digitization of books may itself be perpetually open-ended as the number of new texts outstrips the ability of the technology, even if it were entirely automated, to stay up with production. However, Google Books has worked to improve the scanner technology, keeping its technological advances a guarded industry secret. The results over the course of a single decade was noticeable.

Far more complicated problems, however, arise when questions about copyright protection and the application of current intellectual property laws to the potential for widespread accessibility and reproduction posed by the mass digitization project. Laws lag far behind technology. Copyright protection for both publishers and authors has existed only for a relatively brief time—the first statutes addressing the issue date to Britain in first decades of the eighteenth century, more than three hundred years after the invention of the printing press first created the problem of piracy and property infringement. Much like music and movies, media that have faced similar piracy issues, policing book rights will challenge computer technicians to design software that protects the documents and ensures the rights of authors and publishers to appropriate compensation. There are simply no enforceable laws in place. Although if properly regulated, digitization of books offers publishers and authors quicker and easier access to a wide, indeed global, marketplace, until that protection is in place the process poses significant legal (and ethical) challenges. The impact of digitization on traditional booksellers and even online sales platforms like Amazon and, more important, its long-term impact on the role of publishing houses themselves has only begun to be measured.

Viewpoints

Even critics acknowledge that the mass digitization of books offers an historic opportunity to create a global reading public, a way to encourage nations and their governments to address finally long-standing problems with widespread illiteracy. More to the point, by making such a body of knowledge available, digitization of books will enhance research opportunities, offering readers and scholars access to a far wider body of materials, particularly the large archive of manuscripts produced before modern printing techniques, manuscripts that have long been housed in protective environments and have permitted only limited access. The digitization process will help preserve those documents by limiting handling. By providing wide access to centuries of materials, mass digitization will provide readers the chance to trace the evolution of ideas—political, social, economic, religious—in a way that will help readers and researchers better understand history and culture. It is, advocates argue, a system with virtually limitless promise. More practically, given the ever-increasing body of analog (or printed) materials, libraries can provide additional services and reallocate their space for patron usages. Indeed, shelving archives is an area of tremendous concern to librarians and library administrators—despite elaborate shelving arrangements, libraries simply run out of space and are forced to warehouse or discard lesser used books.

Advocates predict that the mass digitization of books will ultimately fulfill the incipient promise of libraries and bookstores themselves: A comprehensive, searchable digital index that provides a helpful overview of and convenient access to humanity's entire body of knowledge. Even in its early stages, mass digitization has provided a revolutionary new way to share information, to read the widest possible range of materials desired, all as part of an ever-growing community of readers. Although, as Adam Hammond argues in the introduction to his 2016 study of the impact of digital technologies on books, the "fate of print is by no means sealed," and in fact print has proven to be a durable medium, reports of its imminent demise turning out to be premature. A study published in the American Economic Journal: Economic Policy (Nagaraj, Reimers, 2023), found that the available of texts on Google Books actually increased the sale of physical copies of those books.

.

Bibliography

Coyle, K. (2006). Mass digitalization of books. The Journal of Academic Librarianship, 32(6), 641–645. https://doi.org/10.1016/j.acalib.2006.08.002

Ginsburg, J. C. (2016). Berne-forbidden formalities and mass digitization. Boston University Law Review, 96(3), 745–775. Retrieved May 23, 2018 from EBSCO Online Database Academic Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=asn&AN=118226654&site=ehost-live

Hahn, T. B. (2007). Impacts of mass digitization projects on libraries and information policy. Bulletin of the Association for Information Science and Technology, 33(1), 20–24. https://doi.org/10.1002/bult.2006.1720330108

Hammond, A. (2016).Literature in the digital age: An introduction. New York: Cambridge UP.

Koonce, L. (2016). Another page in the Google Books saga: Appeals court blesses mass digitization project as fair use. Intellectual Property & Technology Law Journal, 28(2), 20–23. Retrieved May 23, 2018 from EBSCO Online Database Academic Source Ultimate. http://search.ebscohost.com/login.aspx?direct=true&db=bsu&AN=114565919&site=ehost-live

Nagaraj, A., and Reimers, I. (2023, November) Digitization and the Market for Physical Works: Evidence from the Google Books Project. American Economic Journal: Economic Policy (15)4, pp. 428–58.

Piersanti, S. (2023, March 1). The 10 Awful Truths about Book Publishing. Berrett-Koehler Publishers. ideas.bkconnection.com/10-awful-truths-about-publishing

Samuelson, P. (2014). Mass digitization as fair use. Communications of the ACM, 57(3), 20–22. doi:10.1145/2566965

Thylstrup, N. B. (2018). The politics of mass digitalization. Boston: Massachusetts Institute of Technology Press.

Tondelli, C. R. (2021). Mass digitization and the consumer book market of the future. Loyola Consumer Law Review, 33(2), lawecommons.luc.edu/cgi/viewcontent.cgi?article=2079&context=lclr