HPSG 2013

Linguistic research with large annotated web corpora

Felix Bildhauer and Roland Schäfer (Freie Universität Berlin)

Monday, 26.08.2013, FU Berlin, Habelschwerdter Allee 45, Seminarzentrum, Room: L 115

Start: 9:30
Break: 11.00 - 11:30
Lunch & Coffeebreak: 13:00 - 14:30
End: 16:00

Course material:

The world wide web most likely constitutes the hugest existing source of texts written in a great variety of languages. A feasible and sound way of exploiting this data for linguistic research is to compile a static corpus for a given language. For example, we have created linguistically annotated giga-token web corpora for various languages (Dutch 2.5 GT, English 3.9 GT, French 4.3 GT, German 9.1 GT, Spanish 1.6 GT, Swedish 2.3 GT) and are still in the process of creating new corpora (Danish, Japanese, Portuguese, etc.), as well as improving the old ones.
However, anyone who needs to do serious work with web corpora should be aware of the characteristics (and limitations) of such corpora, which depend to considerable extent on a number of decisions taken in the making of such corpora. The first aims of this tutorial is to illustrate the various steps that lead from data collection on the web to the final, linguistically annotated corpus, highlighting the stages where crucial decisions have to be made and how these may be reflected in the corpus.
The second part of this tutorial is a hands-on introduction to the use of the Open Corpus Workbench (a piece of software well suited to store and query very large corpora), with special attention to its integration with the R statistics environment. We use our own web corpora for the demonstration.