|Toine Pieters||Stephen Snelders||Daan Odijk||Fons Laan|
|Universiteit Utrecht||Universiteit van Amsterdam||Universiteit van Amsterdam|
This project aims at using and populating the basic CLARIN infrastructure to enable advanced forms of text mining in large historical datasets of newspapers and journals. The challenge is to convert a specific text mining technology, so-called 'sentiment mining', into an accessible CLARIN compliant web-application addressing research questions of the intended user group of historians and policy researchers. The demonstrator will build on the sentiment mining tools developed in the STEVIN DuOMAn project. The interdisciplinary project-team (historians, linguists, computer scientists) will tailor existing tools to the specific needs of digital humanities research, with a special focus on opinions/perceptions regarding the use and abuse of drugs between 1900 and 1945. The development of this demonstrator prototype will also be used to inventory a list of requirements the CLARIN infrastructure should meet and desiderata it preferably should offer.
From the historian’s point of view WAHSP aims at learning more about ‘public sentiments’ regarding drugs between 1900 and 1945. The ambition of the project is to look further than the opinions and actions of single policy makers and law enforcers. Therefore, WAHSP focuses on the public opinion in a broad sense, as it was formulated in the different news media. Its special attention have the sentiments that can be extracted from these discourses – in the Netherlands, but also in its oversea colonies. In this way, WAHSP aims at making visible ‘hidden’ debates around the use and abuse of drugs, which are in part to be found in articles or commentaries that do not center around drugs at all – the kinds of sentiments that are so common that they are not under discussion.
The search engine is set to search through the digital newspaper collection of the KB, National Library of the Netherlands in The Hague. This digital archive will contain around nine million digitized pages by the end of the digitization process in December 2012, comprising newspapers from the period between 1619 and 1995. Historical newspapers are the obvious source for extracting opinions in earlier times. After all, news media have been a playing ground for shaping and reshaping the perception on drugs in a broad spectrum of discourses within various contexts. Also, the use of news media as the main type of primary sources enables us to take into account the diversity and fluidity of public discourses around the use and abuse of drugs.
The collaboration between the various scholars resulted in a semi-automatic and interactive open-source application that extracts relevant data from a mass of seeming irrelevance. Obviously, the realized, user-friendly application does not replace, but supports the intuition and insights of the scholars. The WAHSP tool is, after all, a query based search engine, whose efficacy depends on the expertise of the scholar to come up with valuable keywords. What sets this tool apart from ‘normal’ search engines, are the functionalities set to analyze the results. A quantitative analysis of the sources used can give insight into in which social context to situate the occurrence of words. A word cloud generates the context within which certain keywords were used. The WAHSP tool is able to yield word clouds from singular articles, but also from the results of queries – in other words, from possibly hundreds of articles together. It is within the word cloud and within articles that sentiments can be highlighted.
One of the deliverables of a CLARIN-project is a webserver showing the working of the tools and/or the curated data. Via a webserver each humanities scholar with "CLARIN-permission" must be able to use the tool and/or data via the Internet from his or her own place: