The volume of contemporary documents and the
number embedded object types is steadily growing. There is a lack of
understanding how to compare documents containing heterogeneous
digital objects and what hardware and software configurations would
be cost-efficient for handling document related operations such as
appraisals. The project addresses the problems of image/text/vector
graphics extraction from PDF documents, computation of
image/text/vector graphics characterictics, fusion of similarity
metrics computed for all PDF components, grouping of documents,
integrity verifications, and sampling. Aditionally we perform
counting operations using cluster computing and the Map and Reduce
programming paradigm.
PDF content is shown in
the developed framework, Document To Learn (Doc2learn). Here image
primitves are described as groups of colors.
The figure shows multiple views of the PDF content as needed
for appraisal. While the PDF reader would present the view of the
document layout per page, other views might be needed for
understanding the characteristics of PDF components.
IOGraph: An illustration
of how the IOGraph tool is used for finding the shortest conversion
sequence from an original 3D file format (stp) to a target 3D file
format (ply).
Motivated by the large number of file formats
within the 3D file domain, NCSA Polyglot was created to provide an
extensible, scalable, and quantifiable means of converting between
formats. The system is extensibile in terms of being able to easily
incorporate new conversion software, scalable in being able to
distribulte work load among parrallel machines, and quantifiable in
having a built in framework for measuring information loss across
conversions.
The tool called IOGraph (left) could provide information
about the shortest conversion path from a source file format to a
target format. Around 140 file formats for 3D data and 17 software
packages with import/export and 3D display functionalities are
supported.
The front end of the 'Add->Script' pane of the Conversion
software registry web interface.
Conversion Software Registry (CSR) has been designed for collecting information
about software that are capable of file format conversions. The CSR is part of a broader
initiative to understand and solve the file format conversions problem.
The work is motivated by a community need for finding file format conversions inaccessible
via current search engines and by the specific need to support systems that could actually perform conversions,
such as the new generation of
scalable, multi-OS NCSA Polyglot. conversion service framework. In addition, the value of CSR is in complementing the existing file format registries and introducing software quality information obtained
by content-based comparisons of files before and after conversions. The contribution of this work is in the CSR data model
design that includes file format extension based conversion, as well as software scripts, software quality measures
and test file specific information for evaluating software quality.
File to Learn,
visualization and editing tool in graphic mode. Connecting lines
represent metadata shared by the two Title blocks of engineering
drawing.
A framework for file relationship discovery
between pairs of 2D engineering drawings, and between 2D engineering
drawings and the 3D CAD Models obtained from them has been
developed. The framework consists of modules for automated file
system and content-based metadata extraction, for metadata
organization and storage, and for exploratory visual inspection and
insertion of discovered relationships between pairs of files. The
system level metadata extraction is accomplished by using Aperture
software and leads to information about file name, MIME type, and
disk location. Additional metadata are obtained by performing file
format identification with DROID software based on the PRONOM
repository of file formats. Metadata from content based analyses
come from 2D engineering drawings by applying Optical Character
Recognition (OCR), and from 3D CAD models by keyword based
extraction of information. All metadata extracted are represented as
RDF triples and stored using Tupelo semantic content management
repository. Exploratory visual inspection and insertion of
discovered relationships have been enabled by developing a graphic
user interface, File2Learn to all metadata acquired and by adding
analytical capabilities to support discoveries.