Automatic classification and cropping of very large size collections of scanned paper documents.
The papers represent a large collection of incoming and outgoing correspondence of Abraham Lincoln. The ultimate goal of these images is to provide
an online research and reference work that will provide the color images, transcriptions, and editorial matter for these historical documents,
including Lincoln's extensive legal practice.
In order to preserve the color scale, the paper documents are scanned together with the color scale bar, e.g., the Kodak Q13 Color Separation Guide.
While the color scale bars are important for reproduction and information preservation about true color, brightness and contrast, they might have to be
removed from the on-line electronic documents. The average image size is about 150 MB. Currently, there are about 23,000 scanned document images (3.45 TB),
but the expected amount is 200,000 or 300,000 images (45 TB).
The collection is characterized by a large variability of paper and ink colors in image scans, of the density of writing (text to background
ratio). Finally position of the color scale bar differs among subsets of the documents.
Automated Extraction of Areas of Interest from Scanned Documents of Abraham Lincoln Writings.
In our work, we design algorithms for automatic classification of images containing the documents with or without additional background patterns.
Then, the images with background are automatically analyzed to identify the crop region containing only the documents.
Finally, the algorithms will be applied to a large volume of images that will consist of 103 to 105 pages with each page equal
to about 150MB.
The cropped images will be available on-line via the Illinois Historic Preservation Agency
and the Abraham Lincoln Presidential Library and Museum in the future.
We define an "ideal document image" with following attributes: the paper region is rectangular, the paper color is constant,
the handwritings ink color is constant, the handwritings distribute "uniformly" over the paper, the paper covers the most area of the image including
its center, the background color is different to the paper color.
There may be a color scale bar placed in the top, bottom, left or right side of the image, which does not overlap with the paper region.
The algorithm is divided in to 3 main stages: Training,
Classify and Crop.
Hue and Intensity (Value) components of the histogram.
Before starting the training and classifying stages, every image is sliced into MxN tiles, and each tile is converted to HSV color space and analyzed
to generate the Hue distribution histogram. All the following stages are performed over the tiles histograms of the images, leading to a reduction
in the amount of data to process.
Hue, Saturation, Value (HSV) histogram is preferred because of its intuitive color description. In our case the paper backround color ranges from white
through faded yellow, pink to brown. Similarly the ink color ranged from black, brown, blue and red. In this description Hue is the most important component
leading to computer efficiency, and to more accurate results
The training stage calculates the necessary parameters for the classification and cropping. A residual
(correlation) factor R is computed from the normalized histograms
of two, center and off center tiles, in order to decide if a tile belongs to the ink and paper region or if it should be left out. Factor R is
ranging over a difference color distribution interval [0,2] with 0 for equal color distribution of both tiles and 2 for a maximum difference in
the Hue space.
The training stage can take a subset of the images (e.g. images without the color bar) and estimate interval of R. Another option is to skip
the training stage and set R as an a priori known value.
The classification stage determines the subset of images that need to be cropped again based on values of the factors R.
The cropping stage is applied over the images classified as "not just paper and ink" (mainly images with the color bar). It has the objective
of estimating an ideal cropping area and removing background and color scale palettes. The cropping area boundary is simply a rectangle aligned to the entire
image borders.
Above: The document is converted to tiles and HSV-histogram of each tile is computed. Using normalized histograms
of two, center and off center tiles, residual factor R is obtained for each pair. We postulate that the center region of the image always belongs to the paper
content (. Only the tiles with "similar" histograms are detected as paper tiles, other are treated as color bars. The former are represented by the blue R points
the latter by the red points. Left image shows tiles and their R bar representation.
We have implemented four versions of the classification algorithm with ot without the training stage and with different classification criteria
for the factor R:
- The training input is a set of images without color bar. Classification cropps the tile if R is larger than the maximum value
Rmax of all tiles
of the training images.
- No training. Factor Rmax is estimated for each image separately as a maximum value of 4 tiles with the vertex in
the center of the image. Classification is R > Rmax.
- No training. Rmax is set to the middle point of the difference color distribution interval; Rmax = 1 and hence the
Classification is R > 1.
- No training. Rmax = 1. Classification is R > 1. Additionally the image itself is classified "must be cropped" if
at least 12.5% of its tiles are "croppable" (R > 1).
The four versions were tested using a subset of 348 images. The training set for the first algorithm consisted on 174 files without the color bar.
The most accurate algorithm (the first one) gives almost 100% accuracy. The main disadvantage is that the training process is analogous to classifying
process and the computation is doing twice the same work: crop center, tile image, generate histograms, compare histograms...).
The algorithm without the training (the fourth) gives 96.6% accuracy. The optimization of the process is still in progress.