|
This talk will lay
out the connections among four classes of data-driven problems. Many of
these connections are not understood well, and the talk will focus on
what we as statistical scientists cannot do yet, rather than on what we
have done.
The
problems to be treated include several in which NISS is engaged
currently:
- Data confidentiality (DC)---the need to protect data
subjects and tribute values, yet disseminate useful information.
- Data integration (DI)---combining data across multiple
databases that were not designed with DI in mind.
- Data mining (DM)---the discovery of patterns, information
and knowledge in what are almost always large, complex (and, often,
unstructured) data sets.
- Data quality (DQ)---the kinds of errors, anomalies and
other DQ problems that occur in real databases.
Interactions
among these four problems pose important research challenges for
statisticians. For example, poor DQ protects confidentiality, while DI
(in the form of record linkage) is a means of breaking
confidentiality. Similarly, techniques to protect DC and poor DQ
both affect strongly the ability of DM to identify anomalous data.
The
challenges will be discussed at multiple levels: abstractions, theory
and methodology and (scalable) software tools.
|