About
GeoLearn - Ingestion and Pre-processing, 
								Integration, Data driven modeling

Figure 1: Sketch of the GeoLearn functionality: Ingestion and Pre-processing (top), Integration (middle), Data driven modeling (bottom).

Geospatial Image To Learn (GeoLearn)

The motivation for developing Geospatial Image To Learn (GeoLearn) comes from hydroclimatology and terrestrial hydrology. The research and development in this area addresses the scientific questions about causes and consequences of hydrologic variables through phenomenology, modeling, and synthesis.

GeoLearn has been prototyped as a novel simulation and exploratory environment for prediction modeling from remote sensing imagery, and large size geospatial raster and vector data. The GeoLearn framework has the functionality to read data sets from local and remote sites; extract features like slope from elevation; mosaic tiles; perform quality assurance of remotely sensed images; integrate images; spatially select pixels by masking with boundaries, geo-points, maps with categorical variables, thresholded maps with continuous variables or painted regions using primitives; extract pixels over a mask, perform data-driven modeling using machine learning techniques, provide interpretation of models in terms of variable relevance and visualize a variety of input, output and intermediate data.

Geospatial Image To Learn can be viewed as an encapsulated workflow for:

  • loading multiple raster files (images) from local or remote locations,
  • pre-processing the files based on quality control requirements and performing feature extractions from them if necessary,
  • integrating and mosaicking all raster data sets to form a stack with consistent spatial and temporal resolution as well as geographic projection,
  • loading other files (boundaries, points or images) to create a mask for pixel selection purposes,
  • integrating the existing stack of raster images with other masking information,
  • selecting boundaries or image regions of interest and extracting variables from the stack of images,
  • performing data-driven modeling of selected input and output variables,
  • analyzing data-driven model to assign a relevance coefficient to input variables, and
  • mapping the data-driven model at a pixel level to spatial domain.

All aforementioned steps are supported by visualizations (gray-scale or pseudo-color) of input, intermediate, and output data sets, as well as the data models. An overview of the functionality is provided in Figure 1.

GeoLearn is a java-based, open source, desktop application given the fact that the primary users of the GeoLearn exploratory framework are scientists without access to high performance computing resources. This allows users to run the code on any platform and perhaps modify and extend the framework. However, this approach could be effective only for those components that were developed by the authors or were leveraged as open source codes. The tradeoffs between resources and functionality led to the use of the third party software.

The majority of the code is based on our own Image to Learn (Im2Learn) software to perform an out-of-core image representation, data integration, image manipulation and visualization, with additional calls to:

  • HDF5 library developed by NCSA,
  • ArcGIS Engine developed by ESRI to perform data integration, application of quality control (QC masks),
  • MODIS MRT (data integration),
  • OPeNDAP (remote file access) and
  • Data To Knowledge (D2K, Version 4.1.2) developed by NCSA to perform decision tree modeling.

Figure 2 shows GeoLearn workflow interface that guides a user through the five major steps, such as Load Raster, Create Mask, Attribute Selection, Modeling and Visualization.
Multiple file formats (HDF, netCDF, GeoTiff, DEM, SRTM) can be loaded. Additionaly the files can be retrieve not only from a desktop disk or a networked disk but also from a remote site using OPeNDAP protocol (e.g., NASA DACC). Software also performs feature extraction (from elevation to slope, aspect, flow direction and cumulative flow) and quality assurance and quality control (QA/QC).

GeoLearn workflow interface

Figure 2: GeoLearn workflow interface. The ellipsoid highlights the total memory footprint (around 30MB) for a large number of ingested files using the out-of-core data representation approach. Large input files are represented as multiple tiles (data chunks) that are paged in and out of a desktop computer RAM. Thus, even if the files could not fit into RAM they still could be processed on a desktop computer by keeping only a small portion of the data in RAM at any time.

GeoLearn enables several tradeoff studies related to data integration in terms of projection and spatial resolution parameters of integrated data sets. Different temporal integration schemes have been implemented. Similarly, spatial integration uses different interpolation schemes. Examples of two types of spatial data integrations are shown in Figure 3

Mosaicking integration Coordinate system integration

Figure 3. Examples of mosaicking integration (left) and coordinate system integration (right).

GeoLearn provides five masking methods for pixel subset selection, such as boundary-based (Shapefile), point-based (Table), categorical map-based (Categorical), continuous map-based (Threshold), user-defined (User defined) or any Boolean combination of already created masks. Using the masking functionality, one can explore and compare region-based models or point-based models with or without constraints (land cover label or an elevation range).

Masking methods for pixel subset selestion 1 Masking methods for pixel subset selection 2

Figure 4. Five masking methods for pixel subset selection and extraction.

It is not known many times how earth observations are related and how those relationships vary over space and time. Similarly, in terms of modeling, there is no superior machine learning technique. GeoLearn enables to compare data-driven modeling results obtained by exploring multiple possible relationships among variables and by investigating regression tree, support vector machine and k-nearest neighbor machine learning algorithms.
In addition, we designed a methodology and an algorithm for ranking input variables based on their relevance for predicting output variables. Figure 5 illustrates not only the process of data-driven modeling and visualization but also the assistance in interpreting the relevance as a function of space (Fig.5 - bottom right). In this case, vegetation greenness (NDVI index) is predicted with the leaf area index (LAI), fraction of photo-synthetically active radiation (FPAR) absorbed by the plant canopy and snow cover. As expected, LAI is ranked with the highest relevance at the majority of pixels and snow cover is never ranked with the highest relevance. This type of analysis was used to validate the correctness of relevance assignment based on our prior knowledge.

Illustration regression tree modeling

Figure 5a. Illustration regression tree modeling. Example of the GeoLearn environment.

Visualization and modeling of errors Relevance results in spatial domain

Figure 5b. Input variable relevance analysis and visualization of modeling error (left) and relevance results (right) in spatial domain.

Projects using GeoLearn software:

Learning from Geospatial Data and Images
Algal Biomass Prediction


People, Publications, Presentations

Geospatial Image To Learn was created as a joint collaboration between the Civil and Environmental Engineering Department (CEE) at the University of Illinois at Urbana-Champaign (UIUC) and the National Center for Supercomputing Applications (NCSA) at UIUC.

Team members

  • Praveen Kumar
    Civil and Environmental Engineering, UIUC
  • Peter Bajcsy
    Research group ISDA, National Center for Supercomputing Applications, UIUC
  • Chulyun Kim
    ISDA, National Center for Supercomputing Applications, UIUC
  • Qi Li
    ISDA, Computer Science Department, UIUC
  • Vikas Mehra
    Civil and Environmental Engineering (CEE), UIUC
  • Rob Kooper
    ISDA, National Center for Supercomputing Applications, UIUC
  • Wei-Wen Feng
    ISDA, National Center for Supercomputing Applications, UIUC
  • Yakov Keselman
    National Center for Supercomputing Applications, UIUC
  • Pratyush Sinha
    Civil and Environmental Engineering, UIUC
  • Additional help with the software requirement specifications and code release came from Amanda White (CEE), Ben Ruddel (CEE), and Tim Nee (NCSA).

Acknowledgments

Funding support was provided by National Aeronautics and Space Administration (NASA), National Archive and Record Administration (NARA), and National Science Foundation (NSF).
The NASA project was led by the principal investigators Praveen Kumar and Peter Bajcsy. Praveen Kumar served as the principal investigator for the NSF project. The NARA project was led by the principal investigator Peter Bajcsy.

Presentation

  • Peter Bajcsy and Praveen Kumar, "Cyber-Infrastructure for Critical Zone Explorations in Intensively Managed Landscapes."
    Critical Zone Observatory workshop, Penn State, September 16-18, 2007 presentation on Cyber-Infrastructure.
  • P. Bajcsy, C-Y. Kim, Q. Li, R. Kooper, V. Mehra, R. Robertson and P. Kumar "GeoLearn: An Exploratory Framework for Extracting Information and Knowledge from Remote Sensing Imagery", 32nd International Symposium on Remote Sensing of Environment Sustainable Development Through Global Earth Observations, June 25-29, 2007, San Jose, Costa Rica
  • R.D. Robertson, V. Mehra, P. Kumar, P. Bajcsy and D. Tcheng, "Understanding Monthly Land Surface Relationships at the Continental Scale Using Remotely Sensed Data" American Geological Union (AGU) , San Francisco, 11-15, December 2006, Abstract B31A-1060
  • Peter Bajcsy, Wei-Wen Feng, and Praveen Kumar, "Relevance Assignment and Fusion of Multiple Learning Methods Applied to Remote Sensing Image Analysis" 2nd IEEE International Conference on Space Mission Challenges for Information Technology (SMC-IT 2006), Pasadena, CA, July 17-21 2006.
  • P. Bajcsy, "Data Processing and Analysis.", Section IV (pp. 258-378) in Hydroinformatics: Data Integrative Approaches in Computation, Analysis, and Modeling, eds. P. Kumar, J. Alameda, P. Bajcsy, M. Folk and M. Markus, vol. 2, p534, CRC Press LLC 2006. [cover jpg] [content pdf 26kB]
  • P. Kumar, P. Bajcsy, D. Tcheng, D. Clutter, V. Mehra, W-W Feng, P. Sinha and A. White, "Using D2K Data Mining Platform for Understanding the Dynamic Evolution of Land-Surface Variables.", 2005 Earth-Sun System Technology Conference, University of Maryland, MD, June 28-30, 2005. [abstract] [pdf 268kB]
  • P. Kumar, P. Bajcsy, D. Tcheng, D. Clutter, V. Mehra, W-W Feng, P. Sinha and A. White, "Using D2K Data Mining Platform for Understanding the Dynamic Evolution of Land-Surface Variables.", Proceedings of the 2005 Earth-Sun System Technology Conference, University of Maryland, MD, June 28-30, 2005. [abstract] [pdf 268kB]
  • P. Bajcsy, "Data Processing and Analysis.", Section IV (pp. 258-378) in Hydroinformatics: Data Integrative Approaches in Computation, Analysis, and Modeling, eds. P. Kumar, J. Alameda, P. Bajcsy, M. Folk and M. Markus, vol. 2, p534, CRC Press LLC 2006. [cover jpg] [content pdf 26kB]
  • P. Kumar, P. Bajcsy, D. Tcheng, D. Clutter, V. Mehra, W-.W. Feng, P. Sinha and A. White, "Data Driven Discovery", Status Report Hydrologic Information System, CUAHSI Universities allied for water research, Editor: David Maidment, Version 1, September 15, 2005, pp.188-203.

References

  • A. Anyamba, and J.R. Eastman, "Interannual variability of NDVI over Africa and its relation to El Niño Southern Oscillation", Int. J. Rem. Sens., 17(13), 2533-2548 (1996).
  • J. Han and M. Kamber, "Data Mining: Concepts and Techniques", Morgan Kaufmann, San Francisco, CA. (2001)
  • J. Mennis, "Exploring relationships between ENSO and vegetation vigour in the south-east USA using AVHRR data", Int. J. Rem. Sens., 22(16), 3077-3092 (2001).
  • U.S. EPA and CEC (2002), " Ecological Regions of North America - Towards a Common Perspective", Commission for Environmental Cooperation (CEC).
  • A.B. White, "Vegetation Variability and Its Hydro-Climatologic Dependence". Ph.D. Dissertation, Civil and Environmental Engineering, University of Illinois at Urbana-Champaign, 2005.