The papers represent a large collection of incoming and outgoing correspondence of Abraham Lincoln. In order to preserve the color scale, the paper documents are scanned together with the color scale bar. The bars are not important for on-line electronic documents.
The average image size is ~150 MB. Currently, there are about 39,000 scanned images (5.9 TB), the expected amount is 200,000 - 300,000 images (34 TB).
The collection is characterized by a large variability of paper and ink colors in image scans, of the density of writing (text to background ratio). Finaly position of the color scale bar differes among subsets of the documents.
The algorithm is devided in to 3 main stages: Training, Classify and Crop.