Mining large size complex PDF documents for industrial knowledge management and preservation.

Presenter: Rob Kooper, NCSA
Authors: Rob Kooper, William McFadden, Jason Kastner, Michal Ondrejcek, Kenton McHenry and Peter Bajcsy

2009 NCSA Private Sector (PSP) Annual Meeting, May 13-15, NCSA, Illinois (2009)

This poster addresses the problems of comprehensive document comparisons and computational scalability of document mining using cluster computing and the Map and Reduce programming paradigm. Contemporary documents, such as the Adobe Portable Document Format (PDF), contain many types of digital objects that have to be extracted, represented and summarized for further appraisal analyses and data mining purposes. While the volume of contemporary documents and the number of embedded object types have been steadily growing, there is a lack of understanding (a) how to compare documents containing heterogeneous digital objects, and (b) what hardware and software configurations would be cost-efficient for handling document processing operations such as document appraisals.

We have designed a computer-assisted framework for content based appraisal of contemporary documents. The novelty of our work is in designing a methodology and a mathematical framework for comprehensive document comparisons including text, image and vector graphics components of documents, as well as in prototyping the appraisal framework with integrity verification. We present example results of grouping, ranking and integrity verification to illustrate accuracy improvements and automation of appraisal. We have also identified the document statistical summarization to be suitable for parallel execution on computer clusters using the Map and Reduce programming paradigm. This work reports dependencies between the computation speed and the number of cluster nodes, the number of map and reduce operations per node, the partition of documents, and the hardware specifications of clusters.