Online Corpora and search tools
This page contains a collection of links to corpora available on the web as well as links to other pages on corpora. I also collected some info on some very useful search tools for linguistic search with short descriptions.
Useful tools
- First of all, there is always google. Search the web and find the most amazing new word forms =).
- Emily Bender's corpora tool page is simply great =)!
- The Sara server allows you to login into e.g. the BNC using more sophisticated search options, such as:
- regular expressions
- POS
- distances to the right or the left
- form generator (e.g. search for all forms of the verb "go")
- combinations of the above options
- Just to mention another really nice tool to do long, complicated searches on the internet: KWiCFinder. It allows you to do:
- automated searches on the internet that can run over night
- to use extended search options that you would not have available on google
- define restrictions, such as "look at not more than 10 documents for any domain", and thus allows you to filter your results automatically.
- define context, in which you are NOT interested to filter out useful information.
- Be sure to check out the Linguist List's software page. It contains a link to e.g. software for concordances.
- The IMS Corpus Workbench page contains information about this corpus tool, which was developed at the University of Stuttgart. It also contains online demos and a couple of corpora (e.g. the UPenn Treebank, the Verbmobil dialogues, and links to corpora in Czech, Swedish, and Bosnian).
Some commonly used corpora
English corpora
German corpora
- The COSMAS I Online maintained by the IDS, Mannheim, contains 1.7 billion text words, parts of which are POS tagged. The corpus is subdivided into
- spoken language archives
- written language archives
- dialect corpora
- socialinguistically classified corpora
- corpora from different time periods and/or different genres
Pages with further links to corpora