BackgroundToday many corpora of news text are available in a digital format. Even access to corpora of historical newspapers has increased considerably. However, few of those meet present day standards of annotation. In the slipstream of a PhD project on changes in subjectivity of newspaper language (Vis, 2011), this project seeks to improve that situation by curating a corpus of news texts from 1950 and 2002 from five major national Dutch newspapers, which have been annotated for part-of-speech, lemma, subjectivity and direct quotation.
This corpus creates a unique opportunity to test the quality of OCR postcorrection tools.
AimThe VU-DNC project has four main aims:
- to make a unique diachronic corpus of Dutch newspaper articles from five major Dutch newspapers from 1950/1951 and 2002 (2 MW) available to the community of researchers in the humanities,
- to extend the linguistic annotation of discourse with encoding for lexico-grammatical features of subjectivity and quotations,
- to create a gold standard benchmark that can be used for testing and training OCR-postcorrection tools, by aligning uncorrected and corrected versions of the digitized printed newspaper articles from 1950/51,
- to improve the development of metadata within CLARIN by mapping the data categories for the part of speech and lemma coding to the data category registry, and extending the ISOcat categories for the historical spelling variation, subjectivity and quotations.
MethodThe diachronic corpus has been brought in line with current standards and formats as used in the STEVIN Nederlandstalig Referentiecorpus (SoNaR, under development), which has been adapted to the more general FoLiA format (documented by Van Gompel, 2012). These standards and formats have been extended with new layers of annotation. As a result the corpus adheres to the current day CLARIN infrastructure.
A benchmark for OCR-postcorrection has been built on a collection of texts in the older Dutch spelling De Vries - Te Winkel. The pre- and post-correction versions of a larger part of the corpus have been aligned semi-automatically, by employing student-assistants using existing tools and algorithms developed at ILK. Alignment has been made at word level.
The resulting annotated corpus and postcorrection benchmark are available through INL's webservice [***].
|Gompel, M. van (2012)||FoLiA: Format for Linguistic Annotation. ILK Technical Report (ILK 12-03).|
|Vis, K. (2011)||Subjectivity in news discourse. A corpus linguistic analysis of informalization. PhD Dissertation, VU Amsterdam.|