OGI School of Science & Engineering
About OGI Admissions Graduate Education Professional Education Research Alumni
 

20000 NW Walker Rd
Beaverton, OR 97006
503-748-1121
Directions | Map

When and Where
April 11, 2005
9:00 a.m. Classroom 130, Bronson Creek Building


Are you calling me a lawyer? Mining for lexical information

Among the major obstacles to automatic processing of text is the difficulty of obtaining reliable information about the meaning and behavior of words. Machines struggle with utterances involving new usages or new words, such as "Could you IM me those numbers?'' or "On reflection the numbers seem to be a load of scrat'', yet human beings have little trouble. Terascale linguistics seeks to use data gleaned from large text collections to explain this human capability and to make it available to machines. This is especially acute in technical or rapidly changing fields, because existing dictionaries are unlikely to suffice. Therefore, it seems appropriate to explore techniques that allow lexical information to be learnt from data.

One of the standard techniques for this is to prepare, then exploit, large collections of labeled text. An example is the Penn Treebank, which contains roughly 50,000 sentences. Detailed linguistic labeling of corpora of this size, even with good tool support, is a substantial multi-year effort. Corpora created more recently are even bigger: up to gigaword size samples of English, Arabic and Mandarin are now available. The web is even bigger than this, of course. Exhaustive hand-labeling is completely infeasible at this scale. Therefore, my research focus is to develop techniques that can learn interesting and useful information from unlabeled or very lightly labeled corpora.

Specifically, my talk will describe the methods and results of a strand of work (joint with Mirella Lapata and Sabine Schulte im Walde) that aims to embrace and extend an existing classification of verbs. The inputs to this process include both corpus data and the linguistically motivated prior information that so-called "diathesis alternations" are likely to be relevant and important to the classification. I will argue that the prior information is crucial, even the availability of really big training corpora will not necessarily allow useful resources to be induced from corpus data alone.

Chris Brew is currently a faculty member in Computational Linguistics at Ohio State. Prior to that he worked at Edinburgh and at the European Patent Office. He works in is statistical natural language processing, particularly unsupervised and semi-supervised methods. The local event host is Dr. Brian Roark, Assistant Professor in the Computer Science & Electrical Engineering Department


OGI | OHSU | Internal About OGI | Admissions | Graduate Education | Professional Development | Research | Alumni | Contact Us | Campus Map | Library | Directory | Register Now | Disability Services |