The volume of contemporary documents and the
				number embedded object types is steadily growing. There is a lack of
				understanding how to compare documents containing heterogeneous
				digital objects and what hardware and software configurations would
				be cost-efficient for handling document related operations such as
				appraisals. The project addresses the problems of image/text/vector
				graphics extraction from PDF documents, computation of
				image/text/vector graphics characterictics, fusion of similarity
				metrics computed for all PDF components, grouping of documents,
				integrity verifications, and sampling. Aditionally we perform
				counting operations using cluster computing and the Map and Reduce
				programming paradigm.
				 
				PDF content is shown in
				the developed framework, Document To Learn (Doc2learn). Here image
				primitves are described as groups of colors.
				
The figure shows multiple views of the PDF content as needed
				for appraisal. While the PDF reader would present the view of the
				document layout per page, other views might be needed for
				understanding the characteristics of PDF components.
				
				
				
				 
				IOGraph: An illustration
				of how the IOGraph tool is used for finding the shortest conversion
				sequence from an original 3D file format (stp) to a target 3D file
				format (ply).
				 
				
				Motivated by the large number of file formats
				within the 3D file domain, NCSA Polyglot was created to provide an
				extensible, scalable, and quantifiable means of converting between
				formats. The system is extensibile in terms of being able to easily
				incorporate new conversion software, scalable in being able to
				distribulte work load among parrallel machines, and quantifiable in
				having a built in framework for measuring information loss across
				conversions.
				The tool called IOGraph (left) could provide information
				about the shortest conversion path from a source file format to a
				target format. Around 140 file formats for 3D data and 17 software
				packages with import/export and 3D display functionalities are
				supported.
				 
				
				
				
				
				
				
				 
				The front end of the 'Add->Script' pane of the Conversion
				software registry web interface.
				 
				Conversion Software Registry (CSR) has been designed for collecting information 
				about software that are capable of file format conversions. The CSR is part of a broader
				initiative to understand and solve the file format conversions problem. 
				The work is motivated by a community need for finding file format conversions inaccessible 
				via current search engines and by the specific need to support systems that could actually perform conversions, 
				such as the new generation of
				scalable, multi-OS NCSA Polyglot. conversion service framework. In addition, the value of CSR is in complementing the existing file format registries and introducing software quality information obtained 
				by content-based comparisons of files before and after conversions. The contribution of this work is in the CSR data model 
				design that includes file format extension based conversion, as well as software scripts, software quality measures 
				and test file specific information for evaluating software quality.
				 
				
				
				
				
				
				 
				File to Learn,
				visualization and editing tool in graphic mode. Connecting lines
				represent metadata shared by the two Title blocks of engineering
				drawing.
				 
				A framework for file relationship discovery
				between pairs of 2D engineering drawings, and between 2D engineering
				drawings and the 3D CAD Models obtained from them has been
				developed. The framework consists of modules for automated file
				system and content-based metadata extraction, for metadata
				organization and storage, and for exploratory visual inspection and
				insertion of discovered relationships between pairs of files. The
				system level metadata extraction is accomplished by using Aperture
				software and leads to information about file name, MIME type, and
				disk location. Additional metadata are obtained by performing file
				format identification with DROID software based on the PRONOM
				repository of file formats. Metadata from content based analyses
				come from 2D engineering drawings by applying Optical Character
				Recognition (OCR), and from 3D CAD models by keyword based
				extraction of information. All metadata extracted are represented as
				RDF triples and stored using Tupelo semantic content management
				repository. Exploratory visual inspection and insertion of
				discovered relationships have been enabled by developing a graphic
				user interface, File2Learn to all metadata acquired and by adding
				analytical capabilities to support discoveries.