Digital Libraries

 

Neil C. Rowe

U.S. Naval Postgraduate School

 

Digital libraries are the digital counterparts of traditional libraries of books and periodicals. They hold digital representations in minimally structured formats for all kinds of archival human-readable information ("documents"). Primarily they contain text, but now increasingly they include multimedia data like images, audio, and video. Usually digital libraries are distinguished from database systems (see Distributed Databases and Distributed File Systems), data archives (see Data Warehousing), and "knowledge bases" for artificial intelligence (see Knowledge) that all hold well-structured data. They are often implemented as services on the Internet, and many have World Wide Web interfaces (see Internet, Network Architecture, and World Wide Web). In fact, the World Wide Web can be considered as one big digital library. Digital libraries are the most important kind of "information retrieval" system [4, 5].

 

Strong economic incentives are driving publishers and other "information providers" to put information in digital libraries [2]. Production and distribution costs are most of the costs of a book or periodical. Digital delivery of information ("electronic publishing") is cheaper than traditional delivery since production steps are eliminated (documents remain in digital form throughout the process) and distribution is simpler (readers can come to the provider's Internet site) [1]. Electronic publishing permits much easier copying, correction, and update of a document. It can enable better preservation of information (for paper is fragile, easily damaged by fire, flood, and other dangers, and progessively deteriorates with time), although obsolete digital formats can be preservation problems too. Readers often prefer digital libraries to traditional ones because they do not require the physical presence of the user, but can be used from anyplace with computer-network access. This permits freer and faster dissemination of human knowledge. Furthermore, users can exploit helpful software to find what they want in a digital library, and need only pay for services they actually use. This is important because the ever-increasing amount of information available is overwhelming our ability to track it; for instance, the number of scientific journals has been doubling every 15 years. Digital libraries also permit new applications such as data mining (see Data Mining) which seeks hidden patterns in large amounts of information.

 

Meanwhile, computer technology is making digital libraries cheaper to implement, and all sorts of documents have become available in digital form. Memory is the critical technology [2]. The cost of large magnetic-disk memory ("hard drives") is being halved roughly every two years, while the cost of optical-disk memory is also decreasing. As of January 2000, magnetic-disk read-write memory was around $15 US for a gigabyte (1,000,000,000 bytes) of data; optical-disk write-only memory using CD-ROM technology was $2 US for the 650-megabyte (650,000,000-byte) standard disk. And the size of storage technology is continuing to decrease. Soon people will be able to carry around the books, music, and even the video they want in small hand-held digital devices that they download from the Internet.

 

A book typically holds one megabyte of information in its text. A million-volume traditional library of books (the size of a typical university library) holds about 1 terabyte (one million million bytes) in its text. Multimedia information can require significantly more per page (see Multimedia Systems and Multimedia Information Systems). While block diagrams and other simple graphics do not need much, a typical color photograph requires a quarter of a megabyte for adequate reproduction even with image compression, so a book with 100 photographs will need 25 times more storage than one without. Audio and especially video require several magnitudes beyond that. Nonetheless, the rapidly decreasing cost of computer memory has made possible the routine storage of book contents online, and will soon make possible the routine storage of audio and video. Cheap storage also permits archiving of less polished written materials like newspapers, discussions, memorabilia, and experimental data.

 

Implementation issues

 

The major challenge in implementing digital libraries is in handling their large storage and networking requirements [6]. While memory is getting cheaper, access time is still a problem:

  1. The user must first search to find what they want, and this can take time even with good indexes and hash tables (see below).
  2. Magnetic disks require a few hundredths of a second to get ready to read data, and optical disks in a "jukebox" require several seconds to load a new disk.
  3. Many digital libraries have Web interfaces through a single computer (a server) which can be a bottleneck when many or large requests arrive at the same time (see Client-Server Computing and Server Architecture); malicious users may actually aggravate bottlenecks by flooding a site with requests (see Security and Protection).
  4. Delivery of the information across the Internet can take time: around two minutes to transmit the text of an average book at the current maximum telephone-line modem speed of 7,000 bytes per second, and maybe 200 minutes for a book with 100 photographs. Faster digital lines like ISDN, T, and ATM ones at the library and user sites can help, but may not help much because transmission speed is limited by the slowest part of the connection path.
  5. Text and image format conversions may be required to read a retrieved document. HTML, XML, PDF, and PostScript document formats are currently the most popular, but format popularity has changed quickly in the past, and a variety of important specialized formats must be supported too.

So significant time must be anticipated to retrieve documents from digital libraries. Some of this effort may be avoided if documents important to a user can be anticipated and preloaded on an optical disk or an "electronic book" device.

 

A big library is unusable unless users have a good way to find things in it. Much research in information retrieval has explored information-selection interfaces [4, 5]. Links to other documents ("hypertext pointers") provide a way to explicitly connect documents for browsing, but someone must use judgement to make the connections, and the most useful connections can be ones that no one has anticipated. Descriptions of documents ("metadata") and topic outlines of library contents ("subject trees") that the user can examine are used by some systems, but someone must manually describe or classify the documents, which limits the coverage and tends to confine information to the most obvious. So flexible query interfaces are the most popular. A query can be just a list of terms that the user wants to see in a document, as with most Web "search engines". More sophisticated query interfaces allow specification of where the terms appear (e.g. title, author, subject, abstract, caption, or text), incomplete specification of words (like "conferenc*" to cover "conference", "conferencing", and "conferenced"), and boolean expressions describing co-occurrence conditions on query terms (like "moon and (picture or image) and not Apollo"). Some query interfaces take natural-language input in the form of phrases and translate it using artificial-intelligence methods into an unambiguous formal logical specification. Other interfaces use "relevance feedback", reasoning about the characteristics of previous documents that the user liked so as to modify the query to retrieve other good documents.

 

Retrieval success for an interface is usually summarized with two metrics: "precision", the ratio of correct items retrieved to all those retrieved, and "recall", the ratio of correct items retrieved to all correct items in the library. "Correct" means those items that match the user's needs. Both recall and precision are important, but some applications may need to emphasize one more than the other.

 

Query interfaces require good indexing of the library. Indexing every word occurring in every page of the library may be too enormous a task (a phenomenon now occurring on the World Wide Web, though some search engines are making a valiant effort). Indexing generally excludes the most common and therefore least helpful words of a language (the "stop words", like "the" and "it"), and sequences of words need not be indexed as a whole if special codes indicate relative locations in documents, but still much indexing effort is required for any large library. The index should be stored using hashing (see Hashing) for fast lookup. It can also be valuable to recognize synonyms and generalizations of words; for instance, the query "ocean transport" should match the text "tanker in the Pacific". This requires an online thesaurus giving synonym and superconcept information for each sense of each possible word. Librarians have developed useful thesauri for manual indexing, and these can be implemented in software, but they traditionally classify entire books only and permit some imprecision in terminology to fit every book into a category. More precise is the Wordnet software thesaurus system that identifies relationships between distinct senses of large numbers of common words of natural languages; a variety of current research is exploring it and its variants to improve retrieval efficiency.

 

Additional issues

 

Digital libraries are well suited to distributed processing because of their often relatively independent contents. Most documents will be used by only one user at a time, so coordination between distributed processes will not be needed. Hence cached pages will rarely used by more than one user, and caching can be minimal. But popular or timely documents will be important exceptions.

 

Bookkeeping can record usage patterns and thereby set charges and optimize storage (see Electronic Commerce). Digital libraries permit more precise fees than traditional libraries since usage can be better monitored. So for instance, royalties can be calculated based on the number of an author's pages read, or software may enforce upon purchase of an electronic document that only one person at the purchasing site can read it at a time. Different access restrictions for different groups of users (like those from copyright, privacy, or other legal concerns) can be more easily enforced. Nonetheless, many users want no costs and no restrictions at all on their usage of a digital library, following the models of public libraries and broadcast television in the United States. Thus we are increasingly seeing advertising on the pages provided by digital libraries as a way to recoup support costs.

 

Multimedia in digital libraries is a subject of much current research [3] (see Multimedia Information Systems). Even texts like manuscripts must be stored as images if their appearance matters or no digital format is available. (Printed pages can be "scanned" and converted to text with "optical character-reading" software, but this averages only 95% correctness in recognizing characters today.) Since multimedia can require much space, and their traversal of networks can take much time, compression and load-sharing techniques are especially important (see Multimedia Systems). A simple way to implement queries on non-text data is to find, index, and match on the captions. However, captions are not always available nor clearly identified, and usually convey only a small part of the meaning of a multimedia object. Obtaining further information requires some kind of content analysis of the object (see Image Processing), and this can be time-consuming. Simple features of the data suffice for some needs; for instance for decoration with images, we can search for a particular range of size, contrast, color, and so on. Video and audio entail difficult indexing problems, and usually require a partitioning algorithm looking for times of abrupt changes to the data, then separate analysis of each segment's content.

 

Bibliography

 

[1] William Y. Arms, Digital Libraries and Electronic Publishing, Cambridge MA: MIT Press, 1999.

 

[2] Michael Lesk, Practical Digital Libraries: Books, Bytes, and Bucks, San Francisco, CA: Morgan Kaufmann, 1997.

 

[3] Mark T. Maybury, ed., Intelligent Multimedia Information Retrieval, Cambridge, MA: AAAI Press / MIT Press, 1997.

 

[4] Berthier Riberio-Neto and Ricardo Baeza-Yates, Modern Information Retrieval, Reading, MA: Addison-Wesley, 1999.

 

[5] Karen Sparck Jones and Peter Willett, Readings in Information Retrieval, San Francisco, CA: Morgan Kaufmann, 1997.

 

[6] --, Proceedings of the ACM Conference on Digital Libraries, New York: Association for Computing Machinery Press, 1996-.

 

Cross References

 

Client-Server Computing see Digital Libraries.

Data Mining see Digital Libraries.

Data Warehousing see Digital Libraries.

Distributed Databases see Digital Libraries.

File Systems, Distributed see Digital Libraries.

Internet see Digital Libraries.

Knowledge see Digital Libraries.

Multimedia Information Systems see Digital Libraries.

Network Architecture see Digital Libraries.

World Wide Web see Digital Libraries.

 

Dictionary Terms

 

Digital library: A digital archive of human-readable mostly-unstructured information.

 

Document: In digital libraries, a single independent aggregate of information, such as a book, article, image, recording, or video.

 

Electronic publishing: Providing a document in digital form via the Internet instead of printing it and distributing it; the basis of digital libraries.

 

Gigabyte: One thousand million 8-bit bytes of information.

 

Information retrieval: Accessing of mostly-unstructured digital data, usually via a query interface.

 

Megabyte: One million 8-bit bytes of information.

 

Multimedia: Text data mixed with non-text data such as images, audio, or video.

 

Optical disk: Digital storage technology using bit patterns made in a circular plastic disk that are read by a laser, generally a read-only technology. The same technology is used for compact-disk audio recordings.

 

Query: For information-retrieval systems and database systems, a description by a user of some data that they want retrieved. It may be formal or informal depending on the system.

 

Terabyte: One million million 8-bit bytes of information.

 

Thesaurus: Information, usually structured as a tree, describing the terms in an area of human knowledge, their type-subtype relationships, and their synonyms. This is essential for information retrieval from large digital libraries.