LOT Winterschool Course: Linguistic Research using CLARIN

Monday, January 12, 2015 - 09:00 to Friday, January 16, 2015 - 11:00

LOT Winterschool 2015: CLARIN course description

Title course:  Linguistic Research using CLARIN

Teacher: Jan Odijk and guest teachers

The lectures will be each day  (Jan 12 - Jan 16) from 9:00-11:00 hrs


Address: Trans 10, 3512 JK Utrecht

Email address: j.odijk@uu.nl

http://website teacher/: http://www.uu.nl/staff/JEJMOdijk/0

Course info

Level: Introductory

Course description: CLARIN is a research infrastructure for humanities researchers who work or want to work with digital language data. This course will introduce the CLARIN infrastructure, and some components and services contained in it that are relevant to linguists.

Why is it  important to you?

CLARIN enables you

  • To easily find  data, tools and services that you can use in your research
  • To enrich your own corpus automatically with sophisticated linguistic annotations such as part of speech tags, full syntactic structures, etc., to search, browse in this enriched corpus, and to analyze the corpus using these annotations.
  • to store data and tools resulting from your research ,  ensuring their long term  preservation and that they are  available to you and other researchers

The course will teach about specific services and application in the CLARIN infrastructure that contribute to these goals, with a focus on contributions made by the Netherlands CLARIN-NL project. Each session will consist of a lecture and some will include a hands-on session to learn to work with CLARIN tools. Most tools illustrated will operate on data from the Dutch language.

The first day and the last day will consist of updated versions of the corresponding lectures in the CLARIN for Linguists course given in summer 2014. The other days will introduce other data and tools from the CLARIN infrastructure than in summer 2014.

Note (2014-11-07): there has been a change in the day to day programme: the contents of day 2 and day 3 have been swapped: the MIMORE guest lecture by Sjef Barbiers is now on Wednesday (as in the programme below)!

Note (2015-01-07): there has been a change in the day to day programme: the contents of day 2 have moved to day 4, and day 2 is filled with Search & Analysis using OpenSONAR.

The day to day program will include:

Day 1: Introduction – Jan Odijk, Utrecht University

  • Part 1: Introduction to CLARIN, context; searching for data with CLARIN; Virtual Language Observatory, Metadata Search; Overview of the whole course;
  • Part 2: I will use a concrete linguistic research question to illustrate how CLARIN and data and tools in CLARIN can be used to improve the empirical base for linguistic research;

Course Material:  Introduction PPTX PDF   ;   Searching for Data PPTX PDF ;  Illustration 1 PPTX  PDF

Day 2: Search and Analysis with OpenSONAR – Jan Odijk, Utrecht University

I will illustrate how the OpenSONAR interface to the SONAR Dutch text corpus (500 million tokens) enables you to search for examples containing specific words or  combinations of words and their morpho-syntactic properties

Presentation: PPTX  PDF ; Scenario PDF

Day 3: MIMORE (Microcomparative Morphosyntax Research Tool) – Sjef Barbiers, Meertens Institute / Utrecht University.

The MIMORE tool enables researchers to investigate morphosyntactic variation in the Dutch dialects by searching three related databases with a common on-line search engine. The search results can be visualized on geographic maps and exported for statistical analysis. The three databases involved are DynaSAND (the dynamic syntactic atlas of the Dutch dialects), DiDDD (Diversity in Dutch DP Design) and GTRP (Goeman, Taeldeman, van Reenen Project).

Course Material: MIMORE educational module

Day 4: Enrich your Own Corpora – Jan Odijk, Utrecht University

I will illustrate how you can enrich your own (Dutch) data with various kinds of linguistic annotation: spelling corrections, part-of-speech codes, morpho-syntactic features, full syntactic structure, co-reference relations, and more. I will also show how you can use the results of this tool in search engines so that you can search, browse and carry out analyses of your enriched data. You will be able to experiment with the tools yourself in the hands-on part of this lecture

Presentation: PPTX  PDF

Day 5: CLARIN-compatibility and Wrap-up – Jan Odijk, Utrecht University

How can you make your data or tools CLARIN-compatible, and why would you do that in the first place? How  can you store your data/tools in the CLARIN infrastructure? The role of CLARIN-centres, types of CLARIN-centres in the Netherlands. Concluding Overview.

Storing data in CLARIN: PPTX  PDF

Concluding Overview: PPTX   PDF

Reading list

Course readings

Lecture 1:

  • Odijk, J. (2014), ` The CLARIN infrastructure in the Netherlands: What is it and how can you use it?’, unpublished ms., Utrecht University [pdf]

Lecture 2:

  • Oostdijk, N., Reynaert, M., Hoste, V., Schuurman, I. (2013) The Construction of a 500 Million Word Reference Corpus of Contemporary Written Dutch in: Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme (eds. P. Spyns, J. Odijk), Springer Verlag. [pdf]  (Open Access)

Lecture 3:

Lecture 4:

  • TTNWW: Handleiding TST tools voor het Nederlands als Web services in een Workflow, Meertens Institute, 2013 [pdf]

Lecture 5:

Further readings:

Hinrichs, E. & S. Krauwer (2014), ‘The CLARIN Research Infrastructure: Resources and Tools for eHumanities Scholars’, LREC 2014 Proceedings LREC2014  [pdf]

Odijk, J. (2014), `CLARIN-NL : Major Results’, LREC 2014 Proceedings LREC2014 [pdf]

Odijk, J. (2014), ` The CLARIN infrastructure in the Netherlands: Design and Construction’, unpublished ms., Utrecht University [pdf]

Uytvanck, D. van, Stehouwer, H. and Lampen, L. (2012), "Semantic metadata mapping in practice: the Virtual Language Observatory". In Calzolari, N., Choukri, K., Declerck, T., Dogan, M.U., Maegaard B., Mariani J., Odijk, J. and Piperidis, S. (eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). Istanbul, Turkey: European Language Resources Association (ELRA), pp. 1029-1034. [pdf]


UvA Amsterdam
P.C. Hoofthuis
Spuistraat 134
1012 VB  Amsterdam