BasiLex: an 11.5 million words corpus of Dutch texts written for children
Number of pages
SourceComputational Linguistics in the Netherlands Journal, 4, (2014), pp. 191-208
Article / Letter to editor
Display more detailsDisplay less details
SW OZ BSI OLO
Communicatie- en informatiewetenschappen
Computational Linguistics in the Netherlands Journal
SubjectADNEXT (Adaptive Information Extraction over Time); Language & Speech Technology; Language in Society; Learning and Plasticity; Nederlab
This article discusses Basilex, a 13.5-million tokens, 11.5-million Dutch words corpus of written language offered to children in the elementary school age, which was recently finalized. The corpus is automatically analyzed at the levels of part-of-speech tagging and lemmatization, and a limited amount of polysemous words has been partly automatically disambiguated. Also, a lemma-based lexicon is derived. The aim of the present article is threefold: First, to give a description of BasiLex and how it was built, and to discuss its validity. Second, to compare the BasiLex lexicon with two other lexicons regarding differences in their most frequent words: the Schrooten and Vermeer (1994) lexicon, a small and now outdated Dutch corpus of language addressed to children, and a derived lexicon of SoNaR, an adult written language corpus (Oostdijk et al. 2013). Third, we discuss some potential educational applications of BasiLex.
Upload full text
Use your RU credentials (u/z-number and password) to log in with SURFconext to upload a file for processing by the repository team.