Towards a Universal, Quantifiable, and Scalable File Format Converter.

Presenter: Rob Kooper, NCSA
Authors: K. McHenry, R. Kooper and P. Bajcsy

5th International IEEE eScience conference (IEEE e-Science 2009)
Oxford, UK, December 9 - 11, 2009
oral presentation

This paper addresses the problem of designing a universal file format converter. File format conversion is a necessary part of data dissemination and curation. Complete and robust converters however are hard to find and build due to the abundance of file formats, the fact that many formats are closed, and the complexities within individual format specifications. On the other hand many software applications exist that are capable of performing some degree of data conversion between a subset of the available formats.

To take advantage of this we introduce a data structure called an I/O-Graph to store the available input and output formats of these applications. Based on a concept of imposed code reuse we use this to develop a service, NCSA Polyglot, which through this graph is capable of performing the larger union of conversions supported by the underlying software. The Polyglot system is designed to be easily extensible, scalable with the number of conversion requests, and inclusive of all available third party software. Given a data set of files from a particular domain, we are able to assign weights to the edges within the I/O-Graph indicating the amount of information retained during a conversion. These edge weights allow the system to then choose conversion paths with the least amount of information loss.