Efficient n-gram, Skipgram and Flexgram Modelling with Colibri Core
Source
Journal of Open Research Software, 4, 1, (2016), pp. 1-10, article e30ISSN
Publication type
Article / Letter to editor
Display more detailsDisplay less details
Organization
Taalwetenschap
Communicatie- en informatiewetenschappen
Journal title
Journal of Open Research Software
Volume
vol. 4
Issue
iss. 1
Languages used
English (eng)
Page start
p. 1
Page end
p. 10
Subject
Aligned constructions in machine translation; Language & Speech Technology; Language in Society; NederlabAbstract
Counting n-grams lies at the core of any frequentist corpus analysis and is often considered a trivial matter. Going beyond consecutive n-grams to patterns such as skipgrams and flexgrams increases the demand for efficient solutions. The need to operate on big corpus data does so even more. Lossless compression and non-trivial algorithms are needed to lower the memory demands, yet retain good speed. Colibri Core is software for the efficient computation and querying of n-grams, skipgrams and flexgrams from corpus data. The resulting pattern models can be analysed and compared in various ways. The software offers a programming library for C++ and Python, as well as command-line tools.
This item appears in the following Collection(s)
- Academic publications [238441]
- Electronic publications [122537]
- Faculty of Arts [29387]
- Open Access publications [97529]
Upload full text
Use your RU credentials (u/z-number and password) to log in with SURFconext to upload a file for processing by the repository team.