Linguistic research with large annotated web corpora
Felix Bildhauer and Roland Schäfer (Freie Universität Berlin)
Monday, 26.08.2013, FU Berlin, Habelschwerdter Allee 45, Seminarzentrum, Room: L 115
Start: 9:30
Break: 11.00 - 11:30
Lunch & Coffeebreak: 13:00 - 14:30
End: 16:00
Course material:
- Slides (1-up screen version) (session 1)
- Slides (4-up printer version) (session 1)
- Worksheet (session 2 + 3)
- Scripts (session 2 + 3)
The world wide web most likely constitutes the hugest existing source of texts written in a great
variety of languages. A feasible and sound way of exploiting this data for linguistic research is to
compile a static corpus for a given language. For example, we have
created linguistically annotated giga-token
web corpora for various languages (Dutch 2.5 GT, English 3.9 GT, French 4.3 GT, German 9.1 GT,
Spanish 1.6 GT, Swedish 2.3 GT) and are still in the process of creating new corpora (Danish,
Japanese, Portuguese, etc.), as well as improving the old ones.
However, anyone who needs to
do serious work with web corpora should be aware of the characteristics (and limitations) of such
corpora, which depend to considerable extent on a number of decisions taken in the making of such
corpora. The first aims of this tutorial is to illustrate the various steps that lead from data
collection on the web to the final, linguistically annotated corpus, highlighting the stages where
crucial decisions have to be made and how these may be reflected in the corpus.
The second
part of this tutorial is a hands-on introduction to the use of the Open Corpus Workbench (a piece of
software well suited to store and query very large corpora), with special attention to its
integration with the R statistics environment. We use our own web corpora for the demonstration.