Display more detailsDisplay less details
Key wordsautomatic speech recognition systems; consonant recognition; native/Non-native listener/model comparisons
Speakers and VCV material Twelve female and 12 male native English talkers aged between 18-49 contributed to the corpus. Speakers produced each of the 24 consonants (/b/, /d/, /g/, /p/, /t/, /k/, /s/, //, /f/, /v/, //, //, /t/, /z/, //, /h/, /d/, /m/, /n/, //, /w/, //, /y/, /l/) in nine vowel contexts consisting of all possible combinations of the three vowels /i:/ (as in “beat”), /u:/ (as in “boot”), and /ae/ (as in “bat”). Each VCV was produced using both front and back stress (e.g. ‘aba vs ab’a) giving a total of 24 (speakers) * 24 (consonants) * 2 (stress types) * 9 (vowel contexts) = 10368 tokens. Pilot listening tests were carried out to identify unusable tokens due to poor pronunciations or other problems. See the Technical details page for further details of the collection and post-processing procedure. Training, development and test sets Training material comes from 8 male and 8 female speakers while tokens from the remaining 8 speakers are used in the independent test set. A development set will be released shortly. After removing unusable tokens identified during post-processing, the training set consists of 6664 clean tokens. Seven tests sets, corresponding to a quiet and 6 noise conditions, are available. Each test set contains 16 instances of each of the 24 consonants, for a total of 384 tokens. Listeners will identify consonants in each of the test conditions. Minimally, each contribution to the special session should report results on some or all of the test sets. Scoring software will be released in February 2008. Noise The table shows the 7 conditions: TEST SET NOISE TYPE SNR (DB) 1 clean — 2 competing talker -6 3 8-talker babble -2 4 speech-shaped noise -6 5 factory noise 0 6 modulated speech-shaped noise -6 7 3-speaker babble -3 These noise types provide a challenging and varied range of conditions. Signal-to-noise ratios were determined using pilot tests with listeners with the goal of producing similar identification scores (~ 65-70%) in each noise condition. VCV tokens are additively embedded in noise samples of duration 1.2s. The SNR is computed token-wise and refers to the SNR in the section where the speech and noise overlap. The time of onset of each VCV takes on one of 8 values ranging from 0 to 400 ms. In addition, test materials are also available as “stereo” sound files which are identical to the test sets except that the the noise and VCV tokens are in separate channels. We have made the test material available in this form to support computational models of human consonant perception which may wish to make some assumptions about e.g. ideal noise processing, and also to allow for the computation of idealised engineering systems (e.g. to determine performance ceilings). Of course, contributors should clearly distinguish which of their results are based on the single-channel and dual-channel noise sets.