UCSD Libraries Inventory of Digital Projects

 

April 18, 2001

 

 

Submitted to:

Digital Library Innovation Team

 

Submitted by the Digital Library Inventory Task Force:

Geri Ingram, Library Systems Department

Christine Stuart, Art & Architecture Library

David Eyer, Library Systems Department

Chris Frymann, Software Engineering Department

 

Contents

Summary of project inventory

Issues for discussion and research

Appendices:

Mission and Goals of the Digital Library Program

Digital Library Inventory Task Force (Charge)

Survey form on the Web

Table of project responses

 

 


Summary of Project Inventory

 

Executive Summary

 

            The purpose of the data collection exercise was to quickly generate information intended to be useful in planning, potentially in the current budget-building cycle. Although these preliminary results are not definitive enough to meet this optimistic goal, they did identify several critical areas for further discussion and research. Because of the short timeframe set for the inventory, the Task Force did not attempt to undertake the rigorous research necessary to support conclusive, data-based decision-making. On the contrary, the Task Force considered a broad survey to be quite appropriate for this initial foray into planning digital library growth. Recognizing that there are no rigid definitions of the “digital library”, and with the intention of casting the net widely, a survey was designed to be inclusive of all projects “digital”, regardless of the organization, provenance, location or use of the digital material. Predictably, the respondents described projects spanning the length and breadth of services and collections, from digital reformatting programs serving the ephemeral needs of document delivery, to full-resolution digitization of images and manuscripts in what is hoped is “archival” quality, suitable for repurposing and collection-building. 

Summarizing the survey results, the Task Force made no attempt to categorize, analyze, or critique existing or proposed projects. Though introductory and therefore somewhat unwieldy, the background information nevertheless provides a snapshot of some of the important and innovative work undertaken today among departments and committees throughout the UCSD Libraries. Individual responses are available, but for purposes of brevity were not included here. The Task Force respectfully submits this summary with an eye toward stimulating discussion of the many issues critical to the effective progress of the UCSD Digital Library.

 


Method and “metrics”

A list of data elements of interest to the DLIT Steering Committee was used to construct a Web-based survey form. After conferring with the Steering Committee, the Task Force sent the URL of the Web survey to 24 individuals, inviting them to describe their various digital library projects. (See appendices)

Because the survey asked questions about “collections”, responses were variously determinative of file sizes; some reported numbers of images, others reported titles, still others endeavored to measure in bytes the size of their digitized repositories. In order to get a sense of the data space itself, for purposes of this summary where sizes of files were not given in MB, 8-MB jpgs were estimated as the typical size of service images for 35-mm slides; 20-MB were estimated for full resolution scans, e.g., tiff images.

 

Responses

To date there are 31 responses from 13 different Library units, each one concerning a single project. Of the 31 projects, 25 were reported to be “existing” rather than “new”. Eleven projects (9 existing) were reported from one department (Special Collections).

The Task Force received three reports describing the eRes, and Avanti, course reserve and document delivery projects. These existing projects, as well as the Consortial Borrowing Software product soon-to-be provided by the CDL, reflect a very significant commitment by the UCSD (and UC) Libraries in support of both instructional and research programs. The sizes of the repositories, however, are not reflected in the data summary, as the scanned files are not currently permanently stored either off- or online at UCSD Libraries.  Perhaps needless to add, all of these projects do engage in considerable scanning activity, and therefore raise critical issues within the context of resource allocation, budgeting, workload and service priorities.

 

Data space

An estimated 405 GB of digital library data is currently stored at UCSD, but it is difficult to know how much of this is online, and in some cases, where the data actually resides, at any time. For example, the larger repositories are served from the San Diego Supercomputer Center (SDSC) and Academic Computing Services (ACS). 200 GB resides at the SDSC in the form of the Pacific Rim Library (PRL), a component of the Pacific Rim Digital Library Alliance (PRDLA). The PRL repository is predicted to grow to 2-3 terabytes within two years.

There are approximately 22 GB of data stored for the Digital Audio Reserve Project (DARP) at ACS.  And although Scripps Institution of Oceanography (SIO) activity is significant and SIO contributions are a high priority for UCSD’s digital library, their files were not sized at all in this summary; they currently reside on an ACS server.

Library servers also house a fair amount of data. E.g., 100 GB represents the estimated uncompressed size of the Social Sciences Data Collection, while the Digital Image Reserve (DIR) is predicted to need approximately 200 GB to function in a reliable fashion for course reserve purposes.

For all 23 existing projects, estimated total sizes ranged from 10 MB to 200 GB. While a few smaller digitized (Special) Collections were reported to be finite and static, most are considered growing repositories of digitally reformatted material.  Several projects estimate a repository size between 25 and 60 GB. 

 

Data types and file formats

Among the data types stored are images, text/numeric, sound and metadata itself. The format of the image data spans tiff, jpg, gif. Text is stored as html, pdf, SGML and XML.  Sound may be wav, au or “Liquid Audio” (lqm) proprietary format. (Although no respondents reported on map digitization or geospatial information (GIS) projects, the Task Force notes that the Libraries have begun to acquire equipment and resources to support numeric data types for use in GIS and mapping as well.)

 

Lifespan of digital project

            “Lifespan” was variously interpreted as data permanence or project duration. In almost all existing cases the projects were described as permanent; in some cases the data was described as ephemeral. E.g., the Avanti project is considered “permanent” (though the service may be superceded), while the data is not stored longer than 17 days. Only in a few cases were the projects themselves described as “pilots” or interim solutions. In the case of PRL, the respondent indicated that the lifespan depends upon financial support—“could be permanent.”

 

Public audience

Faculty, students, and researchers worldwide are using UCSD materials. Though most projects cited discipline-based audiences (e.g., “the Sciences” or “Arts and Humanities”) several projects listed interdisciplinary and general (academic community) audiences. Special Collections audiences are “universal and specialized” and SIO projects’ audiences are described as “worldwide”, “scientists and public.” In one new project the projected audience consists of “patients, families, UCSD clinicians: physicians, nurses, pharmacists.”

 

Intellectual Property

Five of the projects involve material whose intellectual property attributes have caused us to use IP (Internet Protocol) and pass-wording restriction methods.  All of the SIO projects are copyrighted by SIO or UC. All of the Special Collections materials are in the public domain or “rights have been acquired”. (The survey did not attempt to analyze the level or method of rights-transfer relied upon.)

 

Uses

 Users in every field are downloading files, reading/analyzing/manipulating /listening to files online, and in most cases, printing from the digitized repositories. The survey did not ask for usage statistics.

 

Platform considerations: display, backup and networking needs

Some of the projects depend upon high-quality delivery: display specs for sound include high-performance sound cards and T1 speeds. PRL data requires the ability to display, search and store CJK. EAD files require special viewers to display the SGML encoded finding aids, and even XML is not supported in standard fashion by today’s browsers. Large, dense, numeric datasets such as those comprising the Social Sciences Data Collection demand high bandwidth and state-of-the-art processing and display stations.

Network bandwidth requirements are anticipated to grow for images, video, numeric and audio files.

Few of the projects require a specific platform for building and/or delivery and/or back up of the data itself. Unix is a preferred OS for the management of some projects (SSDC) while Macintosh is the platform of choice for the DARP. Sometimes the issue seems to be the sophistication of the technology; sometimes it is rather an issue of staff availability and expertise with the desired platform. Since the questions of metadata and the integration of services for public access were not asked, no data about interoperability was gathered.

 

Data integrity/security

All existing project reporters indicated that they copy tiff images (and pdf text files) to CDs as a way of preserving a copy offline.  All data are considered to “need” archival quality backups and 100% of the project respondents said their services require 7x24 uptime guarantees. I.e., the respondents view the services and collections provided as critical services in the unit’s core mission.

 

Staffing and Data gathering

            For existing projects the question of staffing resources was answered by enumerating what percentage of FTE provided scanning services for the project. Typically a project performs all scanning or data acquisition in-house (DARP, DIR), though a few projects acquire digital resources from vendors or partners (SSDC, PRL).

Some, especially those projects devoted to document delivery, ILL and/or course reserves, reported a growing need for additional staff to take on the scanning tasks and to develop “in house” scanner expertise. No existing projects reported outsourcing Library materials to non-UC vendors for digital reformatting, although at least one proposed project plans to do so. Librarians, Library Assistants (levels 3,4,5) Programmers, Computer Resource Specialists and Students (level 4) were included in the staff classifications of those who support the digital library projects.

SSDC reported needing additional programming support merely to keep pace with” new technologies and data distribution.” The DARP project reported one Recording Engineer, a project coordinator, and staffing support from the Music Public Services Supervisor.

 

Wireless delivery capability

            For most collections, this has not yet surfaced as an issue. In the case of source and target workstations at a (continental) distance from one another (PRL, some SIO), more “traditional” wireless technologies such as microwave and satellite communications are already in play.

 

New projects

 

            Two small imaging projects are planned by Special Collections (there are 9 existing projects reported.)

            Streaming media is envisioned for materials in the Film and Video collection, converting 16 mm film through analog video to digital (mpeg) files for display and archival storage. Some equipment and student hours would be needed to complete two projects in support of instruction and research. Collections of interest are: newsreels from the ‘50’s and ‘60’s, and 20 hours of moving images from 1920-‘70 in California.

            A patient education database was described as having the potential to serve 10,000 documents over the next 3 years, possibly including licensed content. The respondents noted that this repository is also supported (input, archived) by UCSD Healthcare Patient Education Staff. It is apparently being built on a School of Medicine server, and has enjoyed the services of some SED staff in building the index.

            Pending grant funding, a collaborative project with the San Diego Historical Society is proposed. 5,000 photos and approximately 23 oral histories may be digitized, along with serial pages and a 30-year dataset on fish catch statistics. From SIO Archives would come photos, films, books, ships’ logs, expedition reports, letters and diaries (and possibly specimens from the SIO Fish Collection.) Outsourcing and staffing funds are requested in the grant.


Observations

 

While the inventory was intended to elucidate hardware and software needs, it was much more successful at uncovering implicit trends, dependencies and “self-organizing” principles at work in our innovative, service-oriented environment. Not enough data was gathered to be useful as a decision-making aid for the current planning cycle. Overall the usefulness of the exercise will be proven by the conversations generated by it.

 

By way of example, as the Task Force began its work, there was interest in planning coherently for increased storage and network capacity The survey did provide a sense of the size and growth expectations for current collections. But.” Bandwidth is infinite, and storage is free?” While not exactly accurate, this urban legend reflects a sort of “anthem” for the digitization trend, and one that bears some discussion.

 

It is true that storage devices and media are decreasing in cost to the point where they are becoming negligible; and bandwidth increases with every technological advance—but keeping up with modern network technologies is not exactly inexpensive. Moreover, there are “hidden” ongoing costs in terms of staff resources and technology advances (and replacement) inherent in every project. Compared to the cost of organizing the data, insuring permanence and discoverability, (not to mention preservation and migration of the media and technologies), storage and bandwidth needs are not in and of themselves, costly.

 

Metadata and standards

No data was requested about metadata used to describe the collections digitized in any of the projects. In some of the projects, the respondent noted that there is a desire for a database management program to insure constancy and usability of the metadata used to describe the digitized objects, or simply as an aid to manipulation of the data.

 


Issues for Discussion and Research

 

Ongoing and proposed projects clearly support the Mission and Goals of the Digital Library (see attached for reference). In the description of each of the projects, it is clear that UCSD Libraries is committed to and invested in building digital services and collections to meet faculty and student instructional and research needs. Planning for stability and growth require much more research into all the questions posed on the survey. This brief, libraries-wide overview of the projects begs some larger questions of strategic direction. Some of these might be:

 

What is the Libraries’ role in instructional support?

 

Under what circumstances do we act as a digital reformatting “service bureau?”

 

Should we provide digitization-on-demand? (“Free?”)

 

Who makes collection development decisions for digital reformatting?

 

What role do we want to play in standards development and promulgation? Do the (prevailing and CDL) standards and best practices guides provide practical decision trees to determine how to treat our various projects and collections?

 

What is our intent with regard to digital archiving and preservation of the materials we reformat?

 

Do we play a role as co-publisher with faculty? What is our intent vis-à-vis the “scholarly communication dilemma?”

 

Are we concerned with capturing the “born digital” knowledge created by our faculty?

 

Do we have a research and development role to play in digital library development? (E.g., developing collections with new “behaviors” to create new knowledge, experimenting with preservation strategies…)

 

Do we foresee reliance upon increased funding or on reallocation of existing resources in order to foster digital library program growth?

 

With regard to the reported existing and proposed projects:

 

Are we doing enough to minimize redundancy and optimize partnerships?

 

How do we manage digitized objects and their metadata more efficiently?

 

Are appropriate resources allocated to the Digital Libraries’ highest priorities?

 

Is there a clear development path for “innovative projects-to-ongoing-programs”?

 

How can we maximize gains in services and collections?

 

How can we provide discovery tools of the highest order, provide relevant descriptions?

 

How can we ensure scalability and supportability?

 

What, if anything, should we standardize and/or centralize, in order to insure scalability, repurposing, sharing of resources?