Atkinson, Roderick D., and Laurie E. Stackpole. "TORPEDO: Networked Access to Full-Text and Page-Image Representations of Physics Journals and Technical Reports." The Public-Access Computer Systems Review 6, no. 3 (1995).


1.0 Introduction

The Naval Research Laboratory (NRL) Library and the American Physical Society (APS) are experimenting with electronically disseminating journals and reports over NRL campus networks. The project is called TORPEDO (The Optical Retrieval Project: Electronic Documents Online). It involves storing and disseminating two APS journals (Physical Review Letters and Physical Review E) as well as the NRL collection of unclassified, unlimited distribution technical reports. These paper-format journals and reports are scanned at NRL to create CCITT Group IV image files, the image files are converted to ASCII files using OCR, both types of files are associated with bibliographic information, and they are imported into a client/server-based commercial imaging system.

2.0 Participating Institutions

The NRL Library and the APS have been actively exploring the potentials of electronic information dissemination through a variety of projects.

2.1 The Naval Research Laboratory Library

Created in 1923 by Congress for the Department of the Navy on the advice of Thomas Edison, NRL is the Navy's corporate research and development laboratory. The Ruth H. Hooker Research Library and Technical Information Center (NRL Library) addresses the information needs of the NRL research community, which consists of about 3,500 Federal staff and about 1,500 contractors at the Washington, D.C. facility.

NRL occupies a 130-acre campus of 152 buildings located on the Potomac river in Southwest Washington, D.C. Research facilities are also located in Orlando, Florida; Bay St. Louis, Mississippi; and Monterey, California. In addition, the Library also services NRL's parent organization, the Office of Naval Research (ONR), in nearby Arlington, Virginia.

The research efforts of the Laboratory are concentrated in 17 broad areas: acoustics, advanced space sensing, artificial intelligence, astrophysics, biotechnology, chemistry, condensed matter science, information technology, materials research, optical sciences, plasma physics, radar and electronics, radiation technology, remote sensing, space science, space systems, and structural dynamics.

The NRL Library has been in the forefront of the initiative to move toward a totally digital library. The Library began actively scanning its technical reports collection in 1988, and it has been scanning close to 10,000 pages a day since early 1993. In addition, the Library has supported a campus-wide information system, called the InfoNet, since 1992. InfoNet provides NRL/ONR researchers desktop access to commercial and noncommercial online services on the Internet, the NRL Library's online catalog, NRL resources, CD-ROM databases, and electronic books.

2.2 The American Physical Society

The American Physical Society (APS) is an organization of more than 43,000 physicists worldwide. The APS publishes several major physics research journals: the Physical Review series, Physical Review Letters, and Reviews of Modern Physics. It organizes scientific meetings where new results are reported and discussed. In addition to these primary functions, the Society has many other programs in areas such as education, international affairs, public affairs, and public information.

Since its founding in 1899, the primary purpose of the APS has been to advance the knowledge of physics. Recently the APS became quite active in projects to disseminate its journals electronically. In addition to working with NRL, the APS is involved in several electronic journal dissemination projects, including the development of an archive of the Physical Review at Los Alamos National Laboratory and the dissemination its flagship publication, Physical Review Letters, through OCLC. As part of its commitment to electronic publishing, the APS will utilize SGML for the production of all of its journal publications.

3.0 Project Goals

By working together to disseminate scientific journals electronically, the NRL Library and the APS hope to determine:

  1. The attitudes of scientists toward electronic information.

  2. The attitudes of APS members versus nonmembers.

  3. The feasibility of disseminating journals in image format over campus networks and the Internet.

  4. Researcher preferences for electronic format options (e.g., images versus page-definition files).

  5. The desirable features of future electronic journal systems.

  6. How publishers and libraries can most effectively cooperate in making electronic journals available to scientists, and how they can effectively integrate them with other materials.

  7. What kind of controls can be used to prohibit unauthorized users from accessing the system.

4.0 Project Implementation

TORPEDO is being implemented in three phases. The first phase was completed between January and April 1995. The second phase began in May 1995. The third phase will begin in July 1995.

The three phases of the project are:

  1. Local access from end-user workstations in the NRL Library.

  2. Remote access from anywhere on the NRL's Washington campus network by any of the supported computing platforms (Microsoft Windows, Macintosh, and X Window System workstations).

  3. Internet access from the campus networks of the other NRL research units and from the Office of Naval Research. Dial-access will also be provided for researchers who are working at home or travelling.

5.0 TORPEDO Access

End-user access to TORPEDO is provided through the NRL Library's World-Wide Web home page (http://infonext.nrl.navy.mil). The Library's home page provides documentation to assist end-users in learning about the TORPEDO project, provides access to the home pages of the associated participants (APS and Los Alamos National Laboratory), permits the downloading of freely distributable client software and user guides, and serves as the point of access for TORPEDO. The computer workstation requirements for accessing TORPEDO are identical to those for running NCSA Mosaic.

To deliver electronic versions of journals and technical reports to end-users, TORPEDO uses a commercial imaging software package from Excalibur Technologies called EFS. EFS is predominantly client/server based and comes with freely distributable client software for Microsoft Windows and Macintosh workstations. UNIX access to EFS is currently supported through an X Window System interface, and a true UNIX software client is scheduled for release in the next major EFS upgrade.

6.0 APS Journals

APS forwards issues of Physical Review Letters and Physical Review E via overnight mail to the NRL Library as these issues come off the press. Simultaneously, APS sends the bibliographic data associated with the articles to the NRL Library via electronic mail. The NRL Library scans the journals using a Pentium PC running Microsoft Windows and a duplex scanner. The scanner has an autofeeder and is rated at 20 pages per minute. Images are stored on a Novell NetWare 3.12 server in CCITT Group IV TIFF format. The images are converted to ASCII form using optical character recognition (OCR). As part of the process, the images are deskewed and enhanced. The OCR process is done on a Pentium PC using two software packages that run under Microsoft Windows: the Avatar EnMasse! batch image capture and conversion software and its bundled Calera WordScan Plus software.

Throughout this process, a technician feeds the batch scanner, reviews the OCR process to ensure that text columns are correctly identified, and separates sets of journal image files into distinct articles.

Once the files have been scanned and converted to ASCII form, they are automatically moved from a Novell file server to a SUN SPARCstation 20 using a networked 486 PC running Microsoft Windows. The intermediary PC is required because the directory structure used by Avatar to store the images and ASCII text is different from that used by EFS. In addition, this PC provides the EFS database with updated names for the files, adds the files into an appropriate tree structure, verifies the integrity of the images, and associates the files with the bibliographic data used for field searching.

Image and ASCII files are then imported into the EFS database and stored on 5 1/4" multifunction (read-write) optical disks. The optical disks themselves, 32 in all, are housed inside one Hewlett-Packard jukebox. Each formatted optical disk stores 1 GB of data. Each page-image file is approximately 55 KB and each ASCII file is about 20 KB. Importing the files into EFS is done overnight and requires no operator intervention except to initiate the process.

As the image and ASCII files are imported into the EFS system, they are indexed. The index itself is stored on a 9-GB hard disk. Each morning the SUN server shuts down and reboots so that the new records can be searched by end-users.

The entire process of scanning the paper journals, converting the scanned files to ASCII form, and importing the files into EFS can be performed in one 24-hour period. This means that end-users can have a current issue of a journal electronically available at their workstations one day after it is received by the NRL Library.

7.0 NRL Technical Reports

As part of an ongoing project begun in 1988, the NRL Library has already scanned over 100,000 unclassified technical reports in its collection. These reports are presently stored on 12" WORM optical disks housed in a 50-platter Sony jukebox.

The system used to display these reports, Genesys ImageExtender, is a commercial PC-client-based, IPX/NetBIOS protocol system. Linking the ImageExtender product to an existing catalog produced a system that offers extensive field searching, but is not easily scaled to meet the needs of TORPEDO end-users in a wide-area networked environment. EFS, on the other hand, more closely fits the wide-area network imaging needs of the NRL/ONR community because of its native TCP/IP support, client/server configuration, and multiplatform support. Therefore, those unclassified technical reports that have no distribution restrictions (i.e., unclassified, unlimited documents) are imported into EFS after going through the same processing as the APS journals. These reports are added to the EFS database with their own hierarchy so that they can be searched by end-users either independently or in combination with the journals.

8.0 Searching

End-users can retrieve information from the TORPEDO system using direct, content, and field searching techniques.

8.1 Direct Searching

A direct search is used when the end-user is looking for a specific citation or is simply browsing the collection. End-users move through a hierarchical menu structure for journals and reports to find specific documents of interest. In the case of journals, the tables of contents are presented as the first article in the appropriate volume series to facilitate browsing.

8.2 Content Searching

A content search examines the full text of all documents to find the word or phrase entered by the end-user. The end-user also has the option of limiting content searches to particular journals or reports, volumes or issues, or any combination thereof. Boolean operators may also be used in content searching, although the format used by EFS for Boolean searches is not intuitive.

EFS supports a fuzzy full-text retrieval concept called Adaptive Pattern Recognition Processing (APRP). APRP retrieves documents by recognizing data patterns at a binary level. As a result, data itself automatically directs the creation of indexes that are highly fault tolerant and thereby offers the ability to accurately retrieve information based on an approximation of query terms or phrases. Because the EFS retrieval software seeks patterns rather than exact words or phrases, users can accurately search "dirty" ASCII (raw OCR-processed text) without the need for ASCII cleanup or rekeying.

8.3 Field Searching

EFS supports field searching. No more than 256 characters can be entered into a field, and the fields established for all journals and reports must be the same. Bibliographic data for both the APS journals and technical reports is being added to TORPEDO. End-users will soon be able to search documents for specific authors, titles, or years.

9.0 Electronic Journal Publishing

The APS is now in position to make Physical Review Letters available to the NRL Library in SGML format on a regular basis, thereby eliminating the need to scan and OCR the paper copies. Moreover, Physical Review E is now partially available in SGML and will soon be available entirely in SGML as well as all of Physical Review A through Physical Review D. While the EFS imaging system chosen by the Library for TORPEDO has no native support for SGML (EFS only supports CCITT Group IV TIFF files and ASCII), it does have support for third-party image, word processing, and SGML display applications. In fact, EFS can import files in almost any format and integrate all of them into one full-text and field searchable database. The Library is presently investigating the software viewers of several SGML product vendors for possible integration with TORPEDO.

10.0 Summary

The NRL Library and the APS have made significant strides in making collections of physics journals and technical reports available over networks to a large community of geographically dispersed researchers who utilize a myriad of computing platforms. While these journals and reports currently originate in paper format and are being converted into images only after publication, a fully electronic publication system is coming closer to production. The lessons learned and the end-user feedback coming from the TORPEDO project will have a critical impact on the direction the APS pursues in its electronic publishing efforts as well the methods ultimately adopted by the NRL Library in its quest to provide its research community with a comprehensive digital library.

About the Authors

Roderick D. Atkinson, Electronic Resources Coordinator, Naval Research Laboratory, Code 5220, Washington, DC 20375-5334. Internet: rod@library.nrl.navy.mil.

Laurie E. Stackpole, Chief Librarian, Naval Research Laboratory, Code 5220, Washington, DC 20375-5334. Internet: lauries@library.nrl.navy.mil.


Article Formats

This article is available in both ASCII and HTML formats.

Network Access

ASCII File

List Server: send the e-mail message GET ATKINSON PRV6N3 F=MAIL to listserv@uhupvm1.uh.edu

Gopher: gopher://info.lib.uh.edu:70/00/articles/e-journals/uhlibrary/pacsreview/v6/n3/atkinson.6n3

HTML File

World-Wide Web: http://info.lib.uh.edu/pr/v6/n3/atki6n3.html

Publication Information

The Public-Access Computer Systems Review is an electronic journal that is distributed on the Internet and on other computer networks. It is published on an irregular basis by the University Libraries, University of Houston. There is no subscription fee.

To subscribe, send the following e-mail message to listserv@uhupvm1.uh.edu: SUBSCRIBE PACS-P First Name Last Name.

To retrieve the cumulative index for journal, send the following e-mail message to listserv@uhupvm1.uh.edu: GET INDEX PR F=MAIL.

PACS Review back issues (ASCII and HTML files) are available from the University of Houston Libraries' World-Wide Web server: http://info.lib.uh.edu/pacsrev.html.

Back issues (ASCII files only) are also available from the University of Houston Libraries' Gopher server: info.lib.uh.edu, port 70.

Copyright

This article is in the public domain.

The Public-Access Computer Systems Review is Copyright (C) 1995 by the University Libraries, University of Houston. All Rights Reserved.