A Methodology for File Relationship Discovery.

Presenter: Kenton McHenry, NCSA
Authors: Michal Ondrejcek, Jason Kastner, Rob Kooper and Peter Bajcsy; Collaborator: Kenton McHenry

2009 Microsoft eScience Workshop (e-Science Workshop 2009)
Pittsburg, PA, October 16 – 17, 2009, poster presentation
We present a framework for file relationship discovery between pairs of 2D engineering drawings, and between 2D engineering drawings and the 3D CAD Models obtained from them. The framework consists of modules for automated file system and content-based metadata extraction, for metadata organization and storage, and for exploratory visual inspection and insertion of discovered relationships between pairs of files. The system level metadata extraction is accomplished by using Aperture software and leads to information about file name, MIME type, and disk location. Additional metadata are obtained by performing file format identification with DROID software based on the PRONOM repository of file formats.

Metadata from content based analyses come from 2D engineering drawings by applying Optical Character Recognition (OCR), and from 3D CAD models by keyword based extraction of information. All metadata extracted are represented as RDF triples and stored using Tupelo semantic content management repository. Exploratory visual inspection and insertion of discovered relationships have been enabled by developing a graphic user interface to all metadata acquired and by adding analytical capabilities to support discoveries.

We have tested our file relationship discovery framework by processing a test collection of electronic records about the Torpedo Weapons Recovery Vessel (TWR 841) archived by the US National Archive (NARA). This test collection presents a problem of unknown relationships among 784 2D image drawings and 22 CAD models.