You are on page 1of 8

Researching In the Digital Domain

ADVANCES IN HISTORICAL RESEARCH


Historical research has traditionally been constrained to institutional
settings such as libraries or archives, operated by universities or
government agencies. The National Archives and Records
Administration (NARA) and libraries like Stanford and University of
California at Berkeley offer obvious examples of the places hosting
large collections of interest to historical researchers. While some of
these libraries/archives have been open to the public, frequently the
books, newspapers and other documents have been accessible only to
university staff or qualified researchers. Moreover, people involved in
teaching have found that research often required travel to remote
locations, frequently restricting their research to the summer months
when classes were not in session. Given these conditions, the general
public has, for the most part, not been aware of these materials, nor
allowed access even if they were aware.
Research opportunities have gotten much better for historians and the
public over the past ten years. Historical documents and books have
been digitized in great numbers so that access to many of the primary
documents that record our countrys history is now available via the
Internet. As a result, historical research is now an activity that is
available to anyone with an Internet connection and an interest to
learn about the past.
Examples of Internet-accessible Historical Resources:
California Digital Newspaper Collection--California, US, World
http://cdnc.ucr.edu/cgi-bin/cdnc
Google/Newspapers--Mostly US, Some World
news.google.com/newspapers
Google/Books--California, US, World
books.google.com
The Internet Archive-- California, US, World
www.archive.org
Stanford Daily--Stanford, Palo Alto, California, US, World
http://stanford.dlconsulting.com/cgi-bin/stanford
Library of Congress Newspaper Collection--US

www.loc.gov
JSTORAcademic Paper Archive
www.jstor.org
The Avalon ProjectDocuments in Law, History and Diplomacy
http://avalon.law.yale.edu/
These are but a few of the thousands of sites that now offer historical
information on-line. It goes without saying that thousands more sites
will appear in the future.
FROM PAPER TO DIGITAL
Microfilm
Researching historic documents, particularly newspapers, has required
either having access to a collection of actual newspapers or to
microfilm of the collection. While convenient for libraries/archives,
microfilm is anything but convenient for research. The only way to find
anything on a reel of microfilm is to look at every frame. Many people
can barely take an hour, or two, of squinting at the screens of the
microfilm readers, before they complain of eye strain--or even sea
sicknessand have to call it a day. Once something of interest is
located on microfilm, either the researcher must make a copy using the
on-board printer or make notes via pen and paper.
While microfilm readers allow users to fast-forward or fast-reverse in
order to facilitate searching, they provide no ability to locate specific
information (text or pictures).
Scanners
A scanner is like a copy machine, but instead of copying paper to
paper, it scans paper or microfilm and converts it to digital format,
i.e., an electronic form that can be stored, read and manipulated by
computer.
Digital images (text and pictures) offer researchers the ability to access
documents remotely over the Internet. But even these electronic
documents require an image-by-image visual review to find specific
information.

Optical Character Recognition (OCR)


Optical Character Recognition takes scanned images one step further.
OCR software is able to read scanned text and convert it to searchable
digital text. The success of this process depends heavily on the quality
of the image and the size/typeface of the characters on the original
documents. Good quality originals produce text with low error rates. To
obtain clean readable text from poor originals, human cleanup is
generally required.
Advances in Software
Veridian, developed by Digital Library Consulting, New Zealand, is an
easy-to-use software product for managing digitized newspapers. It
displays images of entire newspapers and also offers the ability to

Select specific newspapers and editions by date


View the table of contents for each edition
Download specific articles (images and text)
Search text by word or name
Allow user comments and text correction

These capabilities allow researchers quick access to specific pages,


topics and words, without having to do an image-by-image search as in
microfilm.
User Text Correction
As previously mentioned, OCR errors can occur when documents are
scanned. It is not uncommon to find so many OCR errors that the
resulting text is unreadable, making it unusable for research purposes.
One of the features that distinguishes Veridan software from other
systems is its ability to allow users to correct OCR errors. This text
correction feature allows individuals to participate in restoring the
accuracy of newspapers in a searchable digital format, as shown below.

Digitized page from the January 18, 1894 San Francisco Call
The Veridian software breaks scanned images into chunks, which are
presented to the user in the windows shown above. The right window
contains an image of an article. When a scanning error is found, the
text is corrected in the left window and then saved.
Over time, these individual corrections add up, and the text of the
digitized papers will match the originals. Any article can be
downloaded for inclusion in research materials (typically in a word
processor document).
THE CALIFORNIA DIGITAL NEWSPAPER COLLECTION
For people interested in California historyparticularly from a day-byday, boots-on-the-ground point of view newspapers providethere now
exists a very extensive collection of California newspapers on-line,
hosted by the Center for Bibliographic Studies and Research at UC
Riverside. This collection contains over 73,000 issues comprising over
600,000 pages and over 6.8 million articlesdating from 1846 until
1922 (intellectual property published after 1922 is generally protected
by US copyright law).
This collection is supported in part by the U.S. Institute of Museum and
Library Services under the provisions of the Library Services and
Technology Act, administered in California by the State Librarian.

Source Newspapers
The California Digital Newspaper Collection contains over forty
historical papers. While many of the papers were short-lived, some
were published for decades, including:
Los Angeles Herald (Los Angeles, 1873-1910)
Weekly Alta California (San Francisco, 1849)
Sacramento Daily Union (Sacramento, 1851-1899)
San Francisco Call (San Francisco, 1890-1913)
Pacific Rural Press (San Francisco, 1871-1922)
Marin Journal (San Rafael, 1861-1920)
Daily Alta California (San Francisco, 1849-1891)
Stanford Daily Archive Now Online
The Stanford Daily (originally the Daily Palo Alto) began publishing in
the Fall of 1892. Recently, the group Friends of the Stanford Daily
has been responsible for the digitizing of this publication, also using
Veridian as its digital management software.

Having this long-running publication on-line offers additional historical


material for Researcherswritten from a student point-of-view. People
interested in accessing this valuable archive can use the link above.)
Volunteers Needed To Restore Digitized Newspapers
While the digitizing of our historical newspapers by organizations like
UC Riverside (CDNC), and the Library of Congress (LOC), is a huge step
in opening up our collective past for public access, these collections
are far from usable because of the OCR errors incurred during
digitization.
Given the size of these collections, seeing them corrected soon is not
likely. But given that the UC Riverside folks have done so much work,
helping in the correction process would be the next step for people
who want to see California newspaper history available to everyone.
The ultimate answer to this problem is better OCR software, but the
short-term solution is for volunteers to manually correct these errors.
Local historical associations could help by committing to correct, say,
100,000 lines a year, as a group project. Given the magnitude of this
effortit will take thousands of people many years to make the
corrections necessary to render all of these papers usable for digital
research.
DATA MINING
Data miningoften called knowledge discovery--involves the use of
computer programs to look for, and extract, data from large data sets,
be that data in databases, or free text. Described somewhat more

formally, is the computational process of discovering patterns in large


data sets involving methods at the intersection of artificial intelligence,
machine learning, statistics, and database systems.
Data mining is both relatively new, and so large in its scope that it is
difficult to describe its domain in simple terms. Moreover, there really
are no bounds to data mining, so we can expect to see evermore
amazing results from researches into this realm of historical research in
the coming years.
While the domain of data mining is quite extensive, all of these
investigations necessarily find their beginnings in digitized data.
Hence, the need to digitize as much of the worlds printed material
grows increasingly more important every day.
Given the growing power of knowledge discovery, and analysis, that is
available to us via computer softwareit becomes evermore important
for us, as a society, to commit increasing resources to the process of
digitizing, and archiving, our nations history, and culture, that is
currently bound up in newspapers and microfilm.
CONCLUSION
The Internet, and digital technologies, are radically changing research
methods of professional historians, as well as offering access to the
public interested in history, and historical data. Historical research,
which has traditionally required visiting institutional libraries and
archives is becoming unnecessary, as the Internet allows world-wide
access to just about any resources that have been digitized. Research
that used to take months now takes only weeks, or sometimes days.
History will be opened up to anyone interested in learning about the
past.
Shifting to the digital domain requires a different mindset about our
view of historical resources, such as books, newspapers, maps,
photographs and so on. The view that these resources were somehow
not to be available to society at large needs to be replaced with the
view that everyone has a right, and perhaps even an obligation, to use
these materials when need arises.
Public policy needs to direct increased funding to the digital domain,
away from the traditional paper domain. In addition to increased
spending on acquiring the rights to copyrighted materials to be added
to our digital archives, software that enables research is in short
supply. Many shortcomings exist in the software that does exist.
Grants to software developers to correct problems with current

software, as well as to develop new software, is a more pressing need


than purchasing of more books for local libraries.
Wayne Martin
Independent Researcher
Palo Alto, CA

wmartin46@yahoo.com
www.scribd.com/wmartin46
www.youtube.com/wmartin46

You might also like