Algorithms and Software for Knowledge Acquisition from Semantically Heterogeneous, Distributed, Autonomous Data Sources

Vasant Honavar, Department of Computer Science, Iowa State University, Ames, IA

Monday, Oct 4, 2004, 4:10 PM
3
19 Snedecor

Development of high throughput data acquisition technologies together with advances in computing, and communications have resulted in an explosive growth in the number, size, and diversity of potentially useful information sources. However, the massive size, semantic heterogeneity, autonomy, and distributed nature of the data repositories present significant hurdles in acquiring useful knowledge from such data. In this talk, I will introduce some of the algorithmic and statistical inference problems that arise in such a setting. I will describe algorithms for learning classifiers from distributed data that offers rigorous performance guarantees (relative to their centralized or batch counterparts). I will describe how this approach can be extended to work with semantically heterogeneous data sources in some important cases that arise in practice, by making explicit, the ontologies (attributes and relationships between attributes) associated with the data sources. This allows user or context-dependent exploration of semantically heterogeneous data sources. Some of the proposed algorithms have implemented as part of INDUS - an open source software package for collaborative discovery from semantically heterogeneous, distributed, autonomous data sources. The talk will touch upon some statistical problems that arise in this setting and conclude with an outline of some directions for further research in this area.


Much of this work has been carried out in collaboration with members of the ISU Artificial Intelligence Research Laboratory and has been funded in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM 0066387).