BasiLex: an 11.5 million words corpus of Dutch texts written for children
Publication year
2014Number of pages
18 p.
Source
Computational Linguistics in the Netherlands Journal, 4, (2014), pp. 191-208ISSN
Publication type
Article / Letter to editor
Display more detailsDisplay less details
Organization
SW OZ BSI OLO
Communicatie- en informatiewetenschappen
Journal title
Computational Linguistics in the Netherlands Journal
Volume
vol. 4
Languages used
English (eng)
Page start
p. 191
Page end
p. 208
Subject
ADNEXT (Adaptive Information Extraction over Time); Language & Speech Technology; Language in Society; Learning and Plasticity; NederlabAbstract
This article discusses Basilex, a 13.5-million tokens, 11.5-million Dutch words corpus of written
language offered to children in the elementary school age, which was recently finalized. The corpus is automatically analyzed at the levels of part-of-speech tagging and lemmatization, and a limited amount of polysemous words has been partly automatically disambiguated. Also, a lemma-based lexicon is derived. The aim of the present article is threefold: First, to give a description of BasiLex and how it was built, and to discuss its validity. Second, to compare the BasiLex lexicon with two other lexicons regarding differences in their most frequent words: the Schrooten and Vermeer (1994) lexicon, a small and now outdated Dutch corpus of language addressed to children, and a derived lexicon of SoNaR, an adult written language corpus (Oostdijk et al. 2013). Third, we discuss some potential educational applications of BasiLex.
This item appears in the following Collection(s)
- Academic publications [238441]
- Electronic publications [122537]
- Faculty of Arts [29387]
- Faculty of Social Sciences [29483]
- Open Access publications [97529]
Upload full text
Use your RU credentials (u/z-number and password) to log in with SURFconext to upload a file for processing by the repository team.