Seminar on Computational Learning and Adaptation


  Toward a Computational Theory of Data Acquisition

David Stork
Ricoh California Research Center
Menlo Park, CA
http://www.rii.ricoh.com/~stork

The construction of intelligent software, such as pattern classifiers, relies both on models (e.g., hidden Markov models, neural networks, decision trees, ...) and on data to train those models. While in practice data aquisition is a significant portion of the total design effort, surprisingly little research effort has been applied to the problem of aquiring and "truthing" data under realistic conditions, for instance when data is inherently noisy and human labellers have varying reliabilities or expertise.

There is both theoretical and experimental evidence that the performance of many classifier systems is limited by lack of data. I'll review justification for turning attention away from minor refinements to learning algorithms and toward the problem of acquiring and truthing very large data sets. Theoretical justification comes from an analysis of bias-variance, prior swamping, and estimation error; experimental justification comes from studies in optical character recognition and speech recognition.

These problems in data acquisition are confronted in the Open Mind Initiative, where non-expert netizens contribute data over the internet. I'll discuss algorithms for estimating contributor reliabilities, "self-policing" (in which independently chosen contributors verify the contributions of others), and outlier rejection. I'll show the optimal data acquisition policy under a simple but realistic model of contributor behavior.

We are moving to an era when the performance of many pattern classifiers is limited by lack of sufficiently large data sets. While it differs from cost-based learning, traditional data mining, and learning with queries, the emerging computational theory of data acquisitionis is ripe for important theoretical advances.

(this talk reports on joint work with Chuck Lam)


Date: Thurs., June 14

Time: 4:15-5:30PM

Place: Ventura 17


Return to the seminar schedule