From Large Volumes of Scanned Lincoln Papers to Virtual Observatories.

Michal Ondrejcek and Peter Bajcsy

2008 NCSA Private Sector (PSP) Annual Meeting, May 12-14, NCSA, Illinois (2008)

This poster and live demonstration will present an end-to-end system for delivering diverse information to the public and to a broad community of humanists studying Abraham Lincoln life. The diverse information is characterized by distributed resources and large volumes of scanned writings of Abraham Lincoln's communication, historical maps from Lincoln's period, and the sixteenth President's daily activities summarized in the Lincoln log. We will demonstrate a prototype system, where (a) scanned writings have been automatically pre-processed (cropped to eliminate color scale bars, scaled and compressed for quick preview and fast retrieval), (b) scans have been georeferenced and temporally correlated with events in the Lincoln log, (c) historical maps have been georeferenced and re-projected to match the Google map interface requirements, and (d) layered Web-interface has been developed to provide access to diverse information according to its multiple dimensions (spatial, temporal, document and relational dimensions).

The objective of this prototype system is not only to support building a virtual observatory of materials related to the sixteenth President, but also to illustrate the computer science challenges addressed in our prototype when dealing with (a) a large volume of image scans (automated processing of 24,000 pages to reach 200,000 to 300,000 pages in the future, with each page equal to about 150MB and the total ~37TB), (b) "dirty" metadata, (c) incomplete georeferencing information, (d) uncertain temporal information and (e) complexity of high-dimensional information when delivered to end users.