Discovery of File Relationships

Many contemporary digital files are replicated during dissemination, updated on regular basis, or partially re-written. In addition, there are many cases when information from multiple files is used for creating a new file or information from one file is split across many files. Thus, relationships among digital files become very complex if the information contained is tracked during the process of design, creation, dissemination and modification. Furthermore, it is quite common that the information about file relationships has been lost by the time digital files are scheduled for archiving. Our objective has been to design and develop an exploratory framework that would assist archivists in gathering information about potential links between files and in visual exploration supporting discoveries of file relationships.

technical drawings, relationship design

Figure 1. An overall design to discovering relationships among multiple sources of electronic records.

An overall design to discovering relationships uses components for extraxting bits of information about files (metadata) and, if possible their content. The file system of the incoming media is inspected first and the metadata in RDF triple format are generated and stored. Next, the content analysis is performed and content-related rdf triples are stored. The meta data storage is searched and the relationships among the documents is established. Finally, the electronic records are organized and processed for archival purposes. The approach is illustrated in the Figure 1.

In more detail, the framework is based on extraction of metadata from a file system using Aperture JAVA framework, extraction of metadata about file formats using DROID file format identification scheme against the PRONOM registry and NCSA 3D file registry Polyglot. The metadata are stored in RDF triple representation using Tupelo, semantic content management repository and, finally the relationships are visualized by the interactive RDF viewer, File To Learn (File2Learn).

The focus of this sturdy is on finding relationships among scanned 2D engineering drawings, blueprints represented by data information in normalized blocks, such as Title, References, and Drawing number blocks, and their corresponding 3D CAD models. The challenge in processing sets of technical drawings are: (a) The drawings are scanned with low resolution, (b) the information blocks are of different shapes or only partially normalized, (c) the lines are skewed and not well defined and the text is a combination of pre-printed characters and freehand lettering (preferred font has been uppercase Gothic). Our goal is to discover file relationships based on metadata matching from the set of two-dimension (2D) engineering drawings and three-dimensional (3D) CAD models. The design of the proposed system for file relationship discovery includes an automatic optical character recognition (OCR) of the information fields within the blocks using JAVA program, and AutoHotKeys scripting of the OCR software by ABBYY, mapped by an ontology for the information extracted.

A collection of electronic records that correspond to 2D image drawings and 3D CAD models of NAVY ships has been analyzed. This test collection presents a problem of unknown relationships among files, which currently includes 784 2D image drawings for the Torpedo Weapon Retriever (TWR 841) ahip, and about 22 CAD models.

technical drawings, title block layout

Figure 2. Title block template used in NAVY showing various information fields: (A) vendor, record of preparation, (B) drawing title, (C) preparing activity, (D)(E) parts are not standardized, (F) Code identification number, (G) drawing size, (H) drawing number, (J) scale, (K) specification number and (L) sheet number.

Metadata location in 2D Image Drawings: Our goal was to understand the challenges of extracting the metadata about each drawing from the image content. We focused on the lower right corner of each drawing that contains a Title block table with multiple entries describing the drawing, a References block relating the particular drawing to a project class and NAVY NAVSEA MMC (Marinette Marine Corporation) drawing numbers. An example of information blocks is presented in the figure 2 and 3.

Optical character recognition (OCR) is a method for extracting the text information from images. Based on OCR software performance especially on scanned templates with low resolution (< 300 dpi), we selected OCR software by ABBYY as the best OCR package.

Position of the blocks has been done manually by recording the coordinates of upper left and bottom right corners. The block sizes were run against the pre-defined templates and the OCR was applied. The content analysis has to be automated since the volume of engineering drawings is too large for manual transcription. The problem has been decomposed into the following sub-tasks:

technical drawings, example of blocks

Figure 3. Examples of Reference (top), Title block (bottom) and MMC number (bottom rim) in a 2D technical drawing file. Text extracted from various sub-areas (mainly the drawing numbers) were used for establishing relationships among the documents.

  1. Develop templates and ontology for all fields in information blocks.
  2. Detect the information block in an image and crop the area.
  3. Identify the type of information block template and crop sub-areas for each information field.
  4. Apply OCR software to each field and associate it with text strings resulting from OCR.
  5. Perform cleansing of the text strings extracted by OCR according to the ontology
  6. Represent text strings as metadata RDF triples and store them in a shared repository.

Multiple metadata from the drawings’ blocks have been generated. The corresponding ontologies (Technical drawing/Blocks/NAVY) for the proper RDF implementation have not been found through open sources. The custom ontology for drawings (tdrw) has been developed by utilizing as much as possible the standards, such as the RDF vocabulary (rdf), Dublin Core Metadata Element Set (dc) and Friend of a Friend project (foaf). Figure 4 shows snapshot of the metadata description from the title block, in a RDF/XML representation using tdrw ontology.

technical drawings, rdf/xml example

Figure 4. Part of the RDF/XML metadata description using technical drawing (tdrw) ontology.

Current implementation of the OCR framework uses the ISDA ImageToLearn library written in Java for cropping, ABBYY software (a desktop application) for OCR, RDF parser library and NCSA Tupelo as a metadata repository. The automation is achieved by developing Java code that crops an information block area using Im2Learn library and creates sub-areas with title block fields, by calling an AutoHotKeys script to launch ABBYY OCR software and by creating RDF triple metadata representation and saving the metadata into a repository using custom Java code. We use ABBYY FineReader 9.0 OCR package for the Optical Character Recognition. The FineReader is a one computer standalone version of ABBYY OCR engine.

Total amount of 784 2D drawings in TIFF file format have been identified which include 170 title blocks, 700 continuation title blocks, 150 reference blocks, a dozen of revision and list of material blocks and about 200 additional areas with the MMC drawing numbers. The most difficult parts for OCR are the fields with handwritten information such as names and dates and fields with measuring units. The most important fields with drawing numbers and drawing references (digits) are much better resolved.

Additionally to the drawing files 22 3D CAD models of the same system have been analyzed by extracting the metadata portion of the file specification. All RDF triples are stored in metadata repository through the Tupelo metadata management system based on semantic web technologies. The advantage of this solution is that Tupelo bridges various applications (desktop, web-based etc.) on one side and the underlying storage tools and technologies (MySQL, PostGIS databases for example) on the other side. A File Relationship Discovery framework accesses the triples again through Tupelo.

STEP metadata specification 3D CAD file metadata in STEP file format Title block metadataof a 2D TIFF file format
FILE_ DESCRIPTION (/* description */ (''), implementation_ level */'2;1'); FILE_ DESCRIPTION
((''),/* implementation_ level */ '2;1');
FILE_NAME (/* name */ '', FILE_NAME ('D:\\NARA\\Archieve_data_samples\\BHD_FR12\\U2110_BHD12_2007_05_09.stp', FILE_NAME ('120 TORPEDO WEAPONS RETRIEVER, MAIN DECK',
/*time_stamp */'', '2007-05 10T13:45:37', ‘04-10-86',
/* author */ (''), ('rakowpj'), ('LDOBSON'),
/* organization */ (''), (''), ('NAVAL SEA COMMAND'),
/*preprocessor_version */ ' ', 'Autodesk Inventor 11', '',
/*originating_system */ '', 'IDA-STEP', '',
/*authorization */ ''); ''); '');

Following challenges in discovering relationships have been encountered so far:

  • There is a lack of evidence (or insufficient metadata information) extracted by the currently used metadata extraction software. Table above shows an example of insufficient overlapping information in the case of the metadata gathered from 3D CAD files in STEP file format and from 2D engineering drawings in TIFF file format. Left column represents the STEP metadata specification, center and right columns show the same information extracted from the STEP 3D file and 2D title block of the same object.
  • There are multiple predicates referring to the same semantic meaning in ontologies used by individual metadata extraction programs (e.g., creator and author). A new editing functionality has been added to the File2learn (below) tool that allows the user to create new mapping RDF triples.
  • There is a lack of understanding about how to find and how to rank overlapping information. For visual inspection different colors are presented for the RDF triples without matched predicates, with the same predicates, and with the same predicates and values.

In order to test a solution to the third problem, the overlapping information has been manually inserted. The tag in the original 2D drawing, author (LDOBSON) has been changed in the metadata about the STEP file.

File To Learn (File2Learn)

File2Learn in graphic mode

Figure 5. File to Learn, visualization and editing tool in graphic mode. Connecting lines represent metadata shared by the two Title blocks of engineering drawing.

Visual inspection and insertion of discovered relationships have been enabled by developing a graphic user interface, File2Learn to all metadata acquired and by adding analytical capabilities to support discoveries. A user can select two files, the search engine will find all connecting links and the overlapping information is presented to the end user together with a small preview of each file (we added support for STEP and TIFF files for the preview). A user can determine if a 3D model was created from a 2D engineering drawing and then create a symbolic link between the two files for future queries. The discovery of such relationships can be achieved by examining different pieces of metadata, such as, a creator of an engineering drawing and a creator of a 3D model or file locations of the two files.

File2Learn in color mode

Figure 6. File to Learn, a tabular presentation of the overlapping information.

Figure 5 shows the pair of 2D file file format with five connecting links based on metadata extracted from a file system (files belong to folders that are topologically related, e.g., 2D files share parent folder). The previews of file contents in separate panes have been added to support visual confirmation of file relationships. Figure 6 presents the same information in a tabular view with color coded entries. The colors correspond to file descriptors that are equal (blue ~ the same predicate and value), are not equal (red – the same predicate but different value) and are different (green – predicate with a value occurs only in one file as a descriptor).

People, Publications, Presentations

Team members

  • Peter Bajcsy
    Research group ISDA, National Center for Supercomputing Applications, University of Illinois
  • Jason Kastner
    ISDA, NCSA, University of Illinois
  • Michal Ondrejcek
    ISDA, NCSA, University of Illinois
  • Rob Kooper
    ISDA, NCSA, University of Illinois

Publications and presentations

  • Michal Ondrejcek, Jason Kastner, Rob Kooper and Peter Bajcsy "Information Extraction from Scanned Engineering Drawings.", Technical Report NCSA-ISDA09-001, December 31, 2009. [pdf 1.3MB]
  • M. Ondrejcek, J. Kastner, R. Kooper and P. Bajcsy, "A Methodology for File Relationship Discovery.", 5th International IEEE eScience conference (IEEE e-Science 2009), Oxford, UK, December 9 - 11, 2009 [abstract][pdf 630kB]
  • M. Ondrejcek, J. Kastner, R. Kooper, P. Bajcsy, and K. McHenry, "A Methodology for File Relationship Discovery.", 2009 Microsoft eScience Workshop(e-Science Workshop 2009), Pittsburg, PA, October 16 – 17, 2009, poster presentation [abstract][pdf 50kB]
  • Michal Ondrejcek, Jason Kastner, Rob Kooper, Kenton McHenry and Peter Bajcsy, "Discovery of relationships between engineering drawings and 3D CAD models for business intelligence gathering.", 2009 NCSA Private Sector Program (PSP) Annual Meeting, May 13-15, 2009, Champaign-Urbana, Illinois, poster presentation [agenda], [abstract]
For additional publications related to this research, please, visit the ERA research web site at