The CLIP Colloquium Series presents...


Multi-document Database Extraction and Fusion

Gideon Mann (U. Mass)
Wednesday, December 6, 2006, 11:00am, AVW 2120

Slides

The MUC evaluations (c.f. Grishman and Sundheim 1995) popularized the problem of extracting information templates, or miniature databases, using the information contained in a single document. This problem was appealing both because it filled a real user need for information and because it could be seen as a route towards language understanding (Riloff 1999). In the past few years, there has been increasing focus on the information extraction from larger, multidocument corpora but most recent research (e.g. Brin 1998) has looked at the extraction of isolated facts as opposed to the databases that were investigated in MUC.

This talk extends fact extraction methods to multidocument corpora and multi-fact databases where redundancy in the corpora and interdependencies in the desired data can be exploited. One of the benefits of large corpora is that they enable training of fact extraction systems (either classification or sequence models) by example. Additionally, large corpora enable information fusion, where raw information is combined in order to improve precision. Fusion can be applied not only to isolated facts, but also across facts. Co-occurrence with a discovered fact can be used as a feature for information extraction. Alternatively, logical database constraints can be applied in a probabilistic fashion to to re-rank fused facts. Experiments will be presented which apply these techniques to biographic fact extraction and corporate succession information where temporal ordering is a key problem. This work shows the viability of fact extraction not only as an end in of itself but also as a useful stepping stone for other tasks.

About the Speaker

Gideon Mann attended Brown University as an undergraduate and graduated with a Sc.B (with honors) in 1999. While there, he worked in the nascent BLIIP and submitted a undergraduate thesis on metaphor detection. After graduation, he went to Johns Hopkins University and was awarded a Masters' degree (2004) and a Ph.D (2006). While at Johns Hopkins, he worked on a wide variety of statistical natural language processing problems, including statistical approaches to question answering, cross-document coreference, and translation lexicon induction before finally focusing on minimally supervised fact extraction and fusion. He currently works as a post-doctoral fellow at the University of Massachusetts/Amherst, researching semi-supervised machine learning approaches with a focus on information extraction and fusion from large document collections.


This talk is part of the CLIP Colloquium Series, organized by Jimmy Lin (jimmylin -at- umd .dot. edu). For the complete schedule, please visit http://www.umiacs.umd.edu/research/CLIP/colloq/.