Human-inspired modulation frequency features for noise-robust ASR
SourceSpeech Communication, 84, (2016), pp. 66-82
Article / Letter to editor
Display more detailsDisplay less details
CLST - Centre for Language and Speech Technology
Communicatie- en informatiewetenschappen
SubjectLanguage & Speech Technology; Language in Society; Learning pronunciation variants for words in a foreign language: Towards an ecologically valid theory based on experimental research and computational modeling; Nederlab; Speech Comprehension; Nederlab
This paper investigates a computational model that combines a frontend based on an auditory model with an exemplar-based sparse coding procedure for estimating the posterior probabilities of sub-word units when processing noisified speech. Envelope modulation spectrogram (EMS) features are extracted using an auditory model which decomposes the envelopes of the outputs of a bank of gammatone filters into one lowpass and multiple bandpass components. Through a systematic analysis of the configuration of the modulation filterbank, we investigate how and why different configurations affect the posterior probabilities of sub-word units by measuring the recognition accuracy on a semantics-free speech recognition task. Our main finding is that representing speech signal dynamics by means of multiple bandpass filters typically improves recognition accuracy. This effect is particularly noticeable in very noisy conditions. In addition we find that to have maximum noise robustness, the bandpass filters should focus on low modulation frequencies. This reenforces our intuition that noise robustness can be increased by exploiting redundancy in those frequency channels which have long enough integration time not to suffer from envelope modulations that are solely due to noise. The ASR system we design based on these findings behaves more similar to human recognition of noisified digit strings than conventional ASR systems. Thanks to the relation between the modulation filterbank and procedures for computing dynamic acoustic features in conventional ASR systems, the finding can be used for improving the frontends in those systems.
Upload full text
Use your RU credentials (u/z-number and password) to log in with SURFconext to upload a file for processing by the repository team.