Seminar on Computational Learning and Adaptation


  Sources of Success for Information Extraction Methods

Joseph Smarr
Symbolic Systems Program
Stanford University
jsmarr@stanford.edu

This talk examines Boosted Wrapper Induction (BWI) as an exemplar of recent rule-based information extraction (IE) techniques. We report on results from a variety of tasks (including extraction from several natural text document collections) to provide a systematic analysis of how each of BWI's algorithmic components, particularly boosting, contributes to its performance over comparable methods. We show that the benefit of boosting comes from its ability to re-weight examples in order to learn specific rules (resulting in high precision) combined with its ability to continue to learn rules after all positive examples have been covered (resulting in high recall). We also propose a new quantitative measure for the regularity of an extraction task, and show that it is a strong indicator of IE performance. Finally, we investigate the impact of exploiting grammatical and semantic information for IE in natural text domains, and we show that even limited grammatical information can improve both the regularity and performance of natural text extraction tasks.



Date: Thursday, October 25

Time: 4:15-5:30PM

Place: Cordura 100


Return to the seminar schedule