On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling

Recent research on the TIMIT corpus suggests that longer-length acoustic models are more appropriate for pronunciation variation modelling than the context-dependent phones that conventional automatic speech recognisers use. However, the impressive speech recognition results obtained with longer-length models on TIMIT remain to be reproduced on other corpora. To understand the conditions in which longer-length acoustic models result in considerable improvements in recognition performance, we carry out recognition experiments on both TIMIT and the Spoken Dutch Corpus and analyse the differences between the two sets of results. We establish that the details of the procedure used for initialising the longer-length models have a substantial effect on the speech recognition results. When initialised appropriately, longer-length acoustic models that borrow their topology from a sequence of triphones cannot capture the pronunciation variation phenomena that hinder recognition performance the most.


INTRODUCTION
Conventional large-vocabulary continuous speech recognis ers use context-dependent phone models, such as triphones, to model speech. Apart from their capability of modelling (some) contextual effects, the m ain advantage of triphones is that the fixed num ber of phonemes in a given language guarantees their robust training when reasonable amounts of training data are available and when state tying m ethods are used to deal with infrequent triphones. W hen using tri phones, one m ust assume that speech can be represented as a sequence of discrete phonemes (beads on a string) that can only be substituted, inserted, or deleted to account for pro nunciation variation [1]. Given this assumption, it should be possible to account for pronunciation variation at the level of the phonetic transcriptions in the recognition lexicon. M od elling pronunciation variation by adding transcription vari ants in the lexicon has, however, met with limited success, in part because of the resulting increase in lexical confusability [2 ]. Furtherm ore, while triphones are able to capture short-span contextual effects such as phonem e substitution and reduction [3], there are complexities in speech that tri phones cannot capture. Coarticulation effects typically have a time span that exceeds that of the left and right neighbour ing phones. The corresponding long-span spectral and tem poral dependencies are not easy to capture with the limited window of triphones [4 ]. This is the case even if the feature vectors implicitly encode some degree of long-span coartic ulation effects thanks to the addition of, for example, deltas and delta-deltas, or the use of augmented features and LDA. In an interesting study with simulated data, McAllaster and Gillick [5] showed that recognition accuracy decreases dra matically if the sequence of HMM models that is used to gen erate speech frames is derived from accurate phonetic tran scriptions of Switchboard utterances, rather than from se quences of phonetic symbols in a sentence-independent m ul tipronunciation lexicon. At the surface level, this implies that the recognition accuracy drops substantially if the state se quence licensed by the lexicon is not identical to the state sequence that corresponds to the best possible segmental ap proximation of the actual pronunciation. At a deeper level, this suggests that triphones fail to capture at least some rele vant effects of long-span coarticulation. Ultimately, then, we m ust conclude that a representation of speech in terms of a sequence of discrete symbols is not fully adequate.
To alleviate the problems of the "beads on a string" rep resentation of speech, several authors propose using longerlength acoustic models [4,[6][7][8][9][10][11][12]. These word or subword 2 EURASIP Journal on Audio, Speech, and Music Processing # -s h + ix s h -ix + n ix -n + # F i g u r e 1: Syllable m odel for the syllable /sh ix n/. The m odel states are initialised with the triphones underlying the canonical syllable transcription [8]. The phones before the m inus sign and after the plus sign in the triphone notation denote the left and right con text in which the context-dependent phones have been trained. The hashes denote the boundaries o f the context-independent syllable model. models are expected to capture the relevant detail, possi bly at the cost of phonetic interpretation and segmentation. Syllable models are probably the m ost commonly suggested longer-length models [4,[6][7][8][9][10][11][12]. Support for their use comes from studies of hum an speech production and perception [13,14], and the relative stability of syllables as a speech unit. The stability of syllables is illustrated by Greenberg in [15] finding that the syllable deletion rate of spontaneous speech is as low as 1%, as compared with the 12% deletion rate of phones. The m ost im portant challenge of using longer-length acoustic models in large-vocabulary continuous speech recognition is the inevitable sparseness of training data in the model training. As the speech units become longer, the num ber of infrequent units with insufficient acoustic data for reliable model param eter estimation increases. If the units are words, the num ber of infrequent units may be unbounded. Many languages-for instance, English and Dutch-also have several thousands of syllables, some of which will have very low-frequency counts in a reasonably sized training corpus. Furtherm ore, as the speech units com prise m ore phones, increasingly complex types of articula tory variation m ust be accounted for.
The solutions suggested for the data sparsity problem are two-fold. First, longer-length models with a sufficient am ount of training data are used in com bination with context-dependent phone models [4,[8][9][10][11][12]. In other words, context-dependent phone models are backed off to when a given longer-length speech unit does not occur frequently enough for reliable model param eter estimation. Second, to ensure that a m uch smaller am ount of training data is suf ficient, the longer-length models are cleverly initialised [8 10]. Sethy and Narayanan [8], for instance, suggest initialis ing the longer-length models with the parameters of the tri phones underlying the canonical transcription of the longerlength speech units (see Figure 1). Subsequent Baum-Welch reestimation is expected to incorporate the spectral and tem poral dependencies of speech into the initialised models by adjusting the means and covariances of the Gaussian com ponents of the mixtures associated with the HM M states of the longer-length models.
Several research groups have published promising, but somewhat contradictory, results with longer-length acous tic models [4,[8][9][10][11][12]. Sethy and Narayanan [8] used the above described mixed-model recognition scheme, com bin ing context-independent word and syllable models with tri phones. They reported a 62% relative reduction in word er ror rate (WER) on TIMIT [16], a database of carefully read, and annotated American English. We adopted their method for our research, repeating the recognition experiments on TIMIT and, in addition, carrying out similar experiments on a corpus ofD utch read speech equipped with a coarser anno tation. As was the case with other studies [4,9,10], the im provements we gained [11,12] on both corpora were more m odest than those that Sethy and Narayanan obtained. Part of the discrepancy between Sethy and Narayanan's im pres sive improvements and the m uch more equivocal results of others [4,[9][10][11][12] may be due to the surprisingly high base line WER (26%) Sethy and Narayanan report. We did, how ever, also find m uch larger improvements on TIMIT than on the Dutch corpus. The goal of the current study is to shed light on the reasons for the varying results obtained on dif ferent corpora. By doing so, we show what is necessary for the successful modelling of pronunciation variation with longerlength acoustic models.
To achieve the goal of this paper, we carry out and com pare speech recognition experiments with a mixed-model recogniser and a conventional triphone recogniser. We do this for both TIMIT and the Dutch read speech corpus, care fully minimising the differences between the two corpora and analysing the remaining (intrinsic) differences. Most im por tantly, we compare results obtained using two sets of tri phone models: one trained with manual (or manually ver ified) transcriptions and the other with canonical transcrip tions. By doing so, we investigate the claim that properly ini tialised and retrained longer-length acoustic models capture a significant am ount of pronunciation variation.
Both TIMIT and the Dutch corpus are read speech cor pora. As a consequence, they are not representative of all the problems that are typical of spontaneous conversational speech (hesitations, restarts, repetitions, etc.). However, the kinds of fundamental issues related to articulation that this paper addresses are present in all speech styles.

TIMIT
The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus [16] is a database comprising a total of 6300 read sentences-ten sentences read by 630 speakers that represent eight major dialects of American English. Seventy percent of the speakers are males and 30% are females.
Two of the sentences for each speaker are identical, and are intended to delineate the dialectal variability of the speak ers. We excluded these two sentences from model training and evaluation. Five of the sentences for each speaker origi nate from a set of 450 phonetically compact sentences, so that seven different speakers speak each of the 450 sentences. The remaining three sentences for each speaker are unique for the different speakers.
The TIMIT data are subdivided into a training set, and two test sets that the TIMIT docum entation refers to as the complete test set and the core test set. No sentence or speaker Annika Häm äläinen et al.  We intended to build longer-length models for words and syllables for which a sufficient am ount of training data was available. To understand the relation between words and syl lables, we analysed the syllabic structure of the words in the corpus. The statistics in the second column of Table 1 show that the large m ajority of all word tokens were monosyllabic. For these words, there was no difference between word and syllable models. In fact, no multisyllabic words occurred of ten enough in the training data to w arrant the training of multisyllabic word models. Hence, the difference between word and syllable models becomes redundant, and we will hereafter refer to the longer-length models as syllable m od els. According to Greenberg [15], pronunciation variation af fects syllable codas and-although to a lesser extent-nuclei more than syllable onsets. To estimate the proportion of syl lable tokens that were potentially sensitive to large deviations from their canonical representation, we examined the struc ture of the syllables in the TIMIT database (see the second column of Table 2). If one considers all consonants after the T a b l e 3: TIMIT phone mappings. The rem aining phonetic labels of the original set were not changed. vowel as coda phonemes, 53.7% of the syllable tokens had coda consonants, and were therefore potentially subject to a considerable am ount of pronunciation variation. TIMIT is m anually labelled and includes manually ver ified phone and word segmentations. For consistency with the experiments on the corpus of Dutch read speech (see Section 2.2), we reduced the original set of phonetic labels to a set of 35 phone labels, as shown in Table 3. To deter mine the best possible phone mapping, we considered the frequency counts and durations of the original phones, as well as their acoustic similarity with each other. Most im portantly, we merged closures with the following bursts and m apped closures appearing on their own to the correspond ing bursts. Using the revised set of phone labels, the aver age num ber of pronunciation variants per syllable was 2.4. The corresponding num bers of phone substitutions, dele tions, and insertions in syllables were 18040, 7617, and 1596.

CGN
The Spoken Dutch Corpus (Corpus Gesproken Nederlands, CGN) [17] is a database of contem porary standard Dutch spoken by adults in The Netherlands and Belgium. It con tains nearly 9 million words (800 hours of speech), of which approximately two thirds originate from The Netherlands and one third from Belgium. All of the data are transcribed orthographically, lemmatised (i.e., grouped into categories of related word forms identified by a headword), and en riched with part-of-speech information, whereas more ad vanced transcriptions and annotations are available for a core set of the corpus.
For this study, we used read speech from the core set; these data originate from the Dutch library for the blind. To make the CGN data more comparable with the carefully 4 EURASIP Journal on Audio, Speech, and Music Processing T a b l e 4: CGN phone m apping. The rem aining phonetic labels of the original set were n o t changed.

Original label
New label spoken TIMIT data, we excluded sentences with tagged particularities, such as incomprehensible words, nonspeech sounds, foreign words, incomplete words, and slips of the tongue from our experiments. The exclusions left us with 5401 sentences uttered by 125 speakers, of which 44% were males and 56% were females. TIMIT contains some repeated sentences; it therefore has higher frequency counts of indi vidual words and syllables, as well as more homogeneous word contexts. Thus, we carried out the subdivision of the CGN data into the training set and the two test sets in a con trolled way aimed at maximising the similarity between the training set and the test set on the one hand, and the training set and the development test set on the other hand. First, we created 1000 possible data set divisions by random ly assign ing 75% of the sentences spoken by each speaker to the train ing set and 12.5% to each of the test sets. Second, for each of the three data sets, we calculated the probabilities of word unigrams, bigrams, and trigram s appearing 30 times or more in the set of 5401 sentences. Finally, we computed Kullback-Leibler distances (KLD) [18] between the training set and the two test sets using the above unigram, bigram, and tri gram probability distributions. We made each KLD symmet ric by calculating it in both directions and taking the average (KLD(p1, p2) = KLD(p2, p1)). The overall KLD-based m ea sure used in evaluating the similarity between the data sets was a weighted sum of the KLDs for the unigram probabil ities, the bigram probabilities, and the trigram probabilities. As the final data set division, we chose the division with the lowest overall KLD-based measure. The final optimised training set comprised 125 speakers and 4027 sentences, whereas the final test sets contained 125 speakers and 687 sentences each. The third colum n of Table 1 shows how m uch data was covered by words with different num bers of syllables. As Table 1 illustrates, the word struc ture of CGN was highly similar to that of TIMIT. The third column of Table 2 illustrates the proportions of the different types of syllable tokens in CGN. CGN had slightly more CV and CVC syllables than TIMIT, but fewer V syllables.
The CGN data comprised manually verified (broad) pho netic and word labels, as well as m anually verified wordlevel segmentations. Only 35 of the original 46 phonetic labels occurred frequently enough for the robust training of triphones. The remaining phones were m apped to the 35 phones, as shown in Table 4. After reducing the n u m ber of phonetic labels, the average num ber of pronuncia tion variants per syllable was 1.8. The corresponding n u m bers of phone substitutions, deletions and insertions in syl lables were 16358, 6755, and 2875, respectively. Compared with TIMIT, the average num ber of pronunciation variants, as well as the num ber of substitutions and deletions, was lower. These numerical differences reflect the differences be tween the transcription protocols of the two corpora. The TIMIT transcriptions were made from scratch, whereas the CGN transcription protocol was based on the verification of a canonical phonem ic transcription. In fact, the CGN transcribers changed the canonical transcription if, and only if, the speaker had realised a clearly different pronuncia tion variant. As a consequence, the CGN transcribers were probably more biased towards the canonical forms than the TIMIT transcribers; hence, the difference between the m an ual transcriptions and the canonical representations in CGN is smaller than that in TIMIT.

Differences between TIMIT and CGN
Regardless of our efforts to minimise the differences between TIMIT and CGN, there are some intrinsic differences be tween them. First and foremost, the two corpora represent two distinct-albeit Germanic-languages. Second, TIMIT contains carefully spoken examples of manually designed or selected sentences, whereas CGN comprises sections ofbooks that the speakers read aloud and, in the case of fiction, some times also acted out. Due to the differing characters of the two corpora-and regardless of the optimised data set divi sion of the CGN material-TIMIT contains higher frequency counts of individual words and syllables, and more hom o geneous word contexts. Because of this, we chose the CGN training and development data sets to be larger than those for TIMIT. A larger training set guaranteed a similar num ber of syllables with sufficient training data for training syllable models, and a larger development test set ensured that the corresponding syllables occurred frequently enough for de term ining the m inim um num ber of training tokens for the models. An additional intrinsic difference between the cor pora is that TIMIT comprises five times as m any speakers as CGN. Due to the relatively small num ber of CGN speakers, we included speech from all of the speakers in all of the data sets, whereas the TIMIT speakers do not overlap between the different data sets. All in all, each corpus has some character istics that make the recognition task easier, and others that make it more difficult, as compared with the other corpus. However, we are confident that the effect of these character istics does not interfere with our interpretation of the results.

Feature extraction
Feature extraction was carried out at a frame rate of 10 milliseconds using a 25-millisecond Hamming window.

5
First-order preemphasis was applied to the signal using a co efficient of 0.97. 12 Mel frequency cepstral coefficients and log-energy with first, and second-order time derivatives were calculated for a total of 39 features. Channel normalisation was applied using cepstral mean norm alisation over individ ual sentences for TIMIT and complete recordings (with a mean duration of 3.5 minutes) for CGN. Feature extraction was perform ed using HTK [19].

Lexica and language models
The vocabulary consisted of 6100 words for TIMIT and 10535 words for CGN. Apart from nine hom ographs in TIMIT and five hom ographs in CGN, each ofw hich had two pronunciations, the recognition lexica comprised a single, canonical pronunciation per word. We did not distinguish hom ophones from each other. The language models were word-level bigram networks. The test set perplexity, com puted on a persentence basis using HTK [19], was 16 for TIMIT and 46 for CGN. These num bers reflect the inherent differences between the corpora and the recognition tasks.

Building the speech recognisers
In preparation for building a mixed-model recogniser that employed context-independent syllable models and tri phones, we built and tested two recognisers: a triphone and a syllable-model recogniser. The performance of the triphone recogniser determ ined the baseline performance for each recognition task.

Triphone recogniser
A standard procedure with decision tree state tying was used for training the word-internal triphones. The procedure was based on asking questions about the left and right contexts of each triphone; the decision tree attem pted to find the con texts that made the largest difference to the acoustics and that should, therefore, distinguish clusters [19]. First, m ono phones with 32 Gaussians per state were trained. The manual (or manually verified) phonetic labels and linear segmenta tion within the m anually verified word segmentations were used for bootstrapping the m onophones. Then, the m ono phones were used for performing a sentence-level forced alignment between the manual transcriptions and the train ing data; the triphones were bootstrapped using the resulting phone segmentations. W hen carrying out the state tying, the m inim um occupancy count that we used for each cluster re sulted in about 4000 distinct physical states in the recogniser. We trained and tested these "manual triphones" with up to 32 Gaussians per state.

Syllable-model recogniser
The first step of im plementing the syllable-model recogniser was to create a recognition lexicon with word pronunciations consisting of syllables. In this lexicon, syllables were repre sented in term s of the underlying canonical phonem e se quences. For instance, the word "action" in TIMIT was now represented as the syllable models ae_k and shix_n.
To create the syllable lexicon, we had to syllabify the canonical pronunciations of words. In the case of TIMIT, we used the tsylb2 syllabification software available from NIST [20]. tsylb2 is based on rules that define possible syllableinitial and syllable-final consonant clusters, as well as pro hibited syllable-initial consonant clusters [21]. The syllabifi cation software produces a m axim um ofthree alternative syl lable clusters as output. Whenever several alternatives were available, we used the alternative based on the m axim um on set principle (MOP); the syllable onset comprised as many consonants as possible. In the case of CGN, we used the syl labification available in the CGN lexicon and the CELEX lex ical database [22]. As in the case of TIMIT, the syllabification of the words adhered to MOP.
After building the syllable lexicon, we initialised the context-independent syllable models with the 8-Gaussian triphone models corresponding to the underlying (canon ical) phonemes of the syllables. Reverting to the example word "action" represented as the syllable models ae_k and sh_ix_n, we carried out the initialisation as follows. States 1 3 and 4-6 of the model ae_k were initialised with the state parameters of the 8-Gaussian triphones #-ae+k and ae-k+#, and states 1-3, 4-6, and 7-9 of the model sh_ix_n with the state parameters of the 8-Gaussian triphones #-sh+ix, sh-ix+n, and ix-n+# (see Figure 1). In order to incorporate the spectral and tem poral dependencies in the speech, the syl lable models with sufficient training data were then trained further using four rounds of Baum-Welch reestimation. To determine the m inim um num ber of training tokens neces sary for reliably estimating the model parameters, we built a large num ber of model sets, starting with a m inim um of 20 training tokens per syllable, and increasing the thresh old in steps of 20. After each round, we tested the resulting recogniser on the development test set. We continued this process until the WER on the development set stopped de creasing. Eventually, the syllable-model recogniser for TIMIT comprised 3472 syllable models, of which those 43 syllables with a frequency of 160 or higher were trained further. These syllables covered 31% of all the syllable tokens in the train ing data. The syllable-model recogniser for CGN consisted of 3885 syllable models, the m inim um frequency for further training being 130 tokens and resulting in the further train ing of 94 syllables. These syllables covered 41% of all the syl lable tokens in the training data. Syllable models with insuf ficient training data consisted of a concatenation of the orig inal 8-Gaussian triphone models.

Mixed-model recogniser
We derived the lexicon for the mixed-model recogniser from the syllable lexicon by keeping the further-trained syllables from the syllable-model recogniser and expanding all other syllables to triphones. In effect, the pronunciations in the lex icon consisted of the following: To use the word "action" as an example, the possible pronun ciations were the following: (a) /ae_k sh_ix_n/, (b) /#-ae+k ae-k+sh k-sh+ix sh-ix+n ix-n+#/, (c) /#-ae+k ae-k+# sh_ix_n/, or /ae_k #-sh+ix sh-ix+n ix-n+#/.
The syllable frequencies determ ined that the actual represen tation in the lexicon was /#-ae+k ae-k+# sh_ix_n/. The initial models of the mixed-model recogniser origi nated from the syllable-model recogniser and the 8-Gaussian triphone recogniser. Four subsequent passes of Baum-Welch reestimation were used to train the mixture of models fur ther. The difference between the syllable-model and mixedmodel recognisers was that the triphones underlying the syllables with insufficient training data for further training were concatenated into syllable models in the syllable-model recogniser, whereas they remained free in the mixed-model recogniser. In practice, the triphones whose frequency ex ceeded the experimentally determ ined m inim um num ber of training tokens for further training were also trained further in the mixed-model recogniser. The m inim um frequency for further training was 20 in the case of TIMIT and 40 in the case of CGN. In the case of TIMIT, the mixed-model recog niser comprised 43 syllable models and 5515 triphones. The mixed-model recogniser for CGN consisted of 94 syllable models and 6366 triphones.

Figures 2 and 3 show the recognition results for TIMIT and
CGN. We trained and tested manual triphones with up to 32 Gaussian mixtures per state; we only present the results for the triphones with 8 Gaussian mixtures per state, as they per formed the best for both corpora. The use of longer-length acoustic models in both the syllable-model and the mixedmodel recognisers resulted in statistically significant gains in the recognition performance (using a significance test for a binom ial random variable), as compared with the per formance of the triphone recognisers. However, the perfor mance of the syllable-model and of the mixed-model recog nisers did not significantly differ from each other. In the case of TIMIT, the relative reduction in WER achieved by going from triphones to a mixed-model recogniser was 28%. For CGN, the figure was a more m odest 18%. Overall, the results for CGN were slightly worse than those for TIMIT. This can, however, be explained by the large difference in the test set perplexities (see Section 3.2).
The second and third columns of Tables 5 and 6 present the TIMIT and CGN WERs as a function of syllable count when using the triphone and mixed-model recognisers. The effect of the num ber of syllables is prom inent: the probabil ity of ASR errors in the case of monosyllabic words is more than five times the probability of errors in the case of poly syllabic words. This confirms what has been observed in pre vious ASR research: the m ore syllables a word has, the less susceptible it is to recognition errors. This can be explained by the fact that a large proportion of monosyllabic words are function words that tend to be unstressed and (heavily) re duced. Polysyllabic words, on the other hand, are more likely to be content words that are less prone to heavy reductions. The fourth columns of Tables 5 and 6 show the percent age change in the WERs when going from the triphones to the mixed-model recognisers. For TIMIT, the introduction of syllable models results in a 50% reduction in WER in the case of bisyllabic and trisyllabic words. For CGN, the situa tion is different. The WER does decrease for bisyllabic words, but only by 11%. The WER for trisyllabic words remains unchanged. We believe that this is due to a larger propor tion of bisyllabic and trisyllabic words with syllable deletions in CGN. Going from triphones to syllable models without adapting the lexical representations will obviously not help if complete syllables are deleted.

ANALYSING THE DIFFERENCES
The 28% and 18% relative reductions in WER that we achieved fall short of the 62% relative reduction in WER that Sethy and Narayanan [8] present. Other studies have also used syllable models with varying success. The absolute improvem ent in recognition accuracy that Sethy et al. [9] obtained with mixed-models was only 0.5%, although the comparison with the Sethy and Narayanan study might not be fair for at least two reasons. First, Sethy et al. used a cross-word left-context phone recogniser, the performance of which is undoubtedly more difficult to improve upon than that of a word-internal context-dependent phone recogniser. Second, their recognition task was particularly challenging with a large am ount of disfluencies, heavy accents, agerelated coarticulation, language switching, and emotional speech. On the other hand, however, the best performance was achieved using a dual pronunciation recogniser in which each word had both a mixed syllabic-phonetic and a pure phonetic pronunciation variant in the recognition lexicon. Even though Jouvet and Messina [10] employed a param eter sharing m ethod that allowed them to build contextdependent syllable models, the gains from including longerlength acoustic models were small and depended heavily on the recognition task: for telephone numbers, the perfor mance even decreased. In any case, it appears that the im provements on TIMIT, as reported by Sethy and Narayanan and ourselves, are the largest.
Obviously, using syllable models only improves recogni tion performance in certain conditions. To understand what these conditions are, we carried out a detailed analysis of the differences between the TIMIT and CGN experiments. First, we examined the possible effects of linguistic and phonetic differences between the two corpora. Second, since it is only reasonable to expect improvements in recognition perfor mance if the acoustic models differ between the recognisers, we investigated the differences between the retrained syllable models and the triphones used to initialise them.

Structure of the corpora
In our experiments, we only manipulated the acoustic m od els, keeping the language models constant. As a consequence, any changes in the WERs are dependent on the so-called acoustic perplexity (or confusability) of the tasks [23]. One should expect a larger gain from better acoustic modelling if the task is acoustically more difficult. The proportion of monosyllabic and polysyllabic words in the test sets pro vides a coarse approximation of the acoustic perplexity of a recognition task. Table 1, as well as Tables 5 and 6, suggest that TIMIT and CGN do not substantially differ in terms of acoustic perplexity.
Another difference that might affect the recognition re sults is that the speakers in the TIMIT training and test sets do not overlap, whereas the CGN speakers appear in all three data sets. One might argue that long-span articulatory de pendencies are speaker-dependent. Therefore, one would ex pect syllable models to lead to a larger improvement in the case of CGN, and not vice versa. So, this difference certainly does not explain the discrepancy in the recognition perfor mance.
Articulation rate is known to be a factor that affects the performance of automatic speech recognisers. Thus, we wanted to know whether the articulation rates of TIMIT and CGN differed. We defined the articulation rate as the n u m ber of canonical phones per second of speech. The rates were 12.8 phones/s for TIMIT and 13.1 phones/s for CGN, a dif ference that seems far too small to have an impact.
We also checked for other differences between the cor pora, such as the num ber of pronunciation variants and the durations of syllables. However, we were not able to identify any linguistic or phonetic properties of the corpora that could possibly explain the differences in the performance gain.

Effect of further training
To investigate what happens when syllable models are trained further from the sequences of triphones used for initialis ing them, we calculated the distances between the probability density functions (pdfs) of the HM M states of the retrained syllable models and the pdfs of the corresponding states of the initialised syllable models in term s of the Kullback-Leibler distance (KLD) [18]. Figures 4 and 5 illustrate the KLD distributions for TIMIT and CGN. The distributions differ from each other substantially, the KLDs generally be ing higher in the case of TIMIT. This implies that the fur ther training affected the TIMIT models more than the CGN models. Given the greater im pact of the longer-length m od els on the recognition performance, this is what one would expect.
There were two possible reasons for the larger impact of the further training on the TIMIT models. Either the bound aries of the syllable models with the largest KLDs had shifted substantially, or the effect was due to the switch from the manually labelled phones to the retrained canonical repre sentations of the syllable models. Since syllable segmenta tions obtained through forced alignment did not show major differences, we pursued the issue of potential discrepancies between manual and canonical transcriptions. To that end, we perform ed additional speech recognition experiments, in which triphones were trained using the canonical transcrip tions of the uttered words. These "canonical triphones" were then used for building the syllable-model and mixed-model recognisers.
In the case of TIMIT, the mixed-model recogniser based on canonical triphones contained 86 syllable models that had been trained further within the syllable-model recogniser us ing a m inim um of 100 tokens. The corresponding syllables covered 42% of all the syllable tokens in the training data. The mixed-model recogniser for CGN comprised 89 sylla ble models trained further using a m inim um of 140 tokens, and the corresponding syllables covered 56% of all the sylla ble tokens in the training data. Further Baum-Welch reesti m ation was not necessary for the mixture of triphones and syllable models; tests on the development test set showed that training the mixture of models further would not lead to improvements in the recognition performance. This was different from the syllable models initialised with the m an ual triphones; tests on the development test set showed that the mixture of models should be trained further for optimal performance. W ith hindsight, this is not surprising. As a re sult of the retraining, the syllable models initialised in the two different ways became very similar to each other. How ever, the syllable models that were initialised with the m an ual triphones were acoustically further away from this final "state" than the syllable models that were initialised with the canonical triphones and, therefore, needed more reestima tion rounds to conform to it. Figures 6 and 7 present the results for TIMIT and CGN. The best performing triphones had 8 Gaussian mixtures per state in the case of TIMIT and 16 Gaussian mixtures per state in the case of CGN. Surprising as it may seem, the results obtained with the canonical triphones substantially outper formed the results achieved with the manual triphones (see Figures 2 and 3). In fact, the canonical triphones even out perform ed the original mixed-model recognisers (see Figures  2 and 3). The performances of the mixed-model recognisers containing syllable models trained with the two differently trained sets of triphones did not differ significantly at the 95% confidence level. In addition, the performance of the canonical triphones was similar to that of the new mixedmodel recognisers. Smaller KLDs between the initial and the retrained syllable models (see Figures 8 and 9) reflected the lack of improvement in the recognition performance. Evi dently, only a few syllable models benefited from the further training, leaving the overall effect on the recognition perfor mance negligible. These results are in line with results from other studies [4,9,10], in which improvements achieved Annika Haämaälaäinen et al.   with longer-length acoustic models are small, and deterio rations also occur. The second and third columns of Tables 7 and 8 present the TIMIT and CGN WERs as a function of syllable count when using the triphone and mixed-model recognisers. As in the case of the experiments with manual triphones (see Tables 5 and 6), the probability of errors was considerably higher for monosyllabic words than for polysyllabic words. The fourth columns of the tables show the percentage change in the WERs when going from the triphones to the mixedmodel recognisers. The data suggest that the introduction of syllable models might deteriorate the recognition perfor mance in particular in the case of bisyllabic words. This may be due to the context-independency of the syllable models and the resulting loss of left or right context inform ation at the syllable boundary. As words tend to get easier to recog nise as they get longer (see Section 5.1), the words with more than two syllables do not seem to suffer from this effect.
The most probable explanation for the finding that the canonical triphones outperform the manual triphones is the mismatch between the representations of speech dur-ing training and testing. While careful manual transcriptions yield more accurate acoustic models, the advantage of these models can only be reaped if the recognition lexicon contains a corresponding level of inform ation about the pronuncia tion variation present in the speech [24]. Thus, at least part, if not all, of the performance gain obtained with retrained syl lable models in the first set of experiments (and probably also in Sethy and Narayanan's work [8]) resulted from the reduc tion of the mismatch between the representations of speech during training and testing. Because the manual transcrip tions in CGN were closer to the canonical transcriptions than those in TIMIT (see Section 2.2), the mismatch was smaller for CGN. This also explains why the impact of the syllable models was smaller for CGN.

DISCUSSION
So far, explicit pronunciation variation modelling has made a disappointing contribution to improving speech recogni tion performance [25]. There are many different ways to attem pt implicit modelling. To avoid the increased lexical  confusability of a multiple pronunciation lexicon, Hain [25] focused on finding a single optimal phonetic transcription for each word in the lexicon. Our study confirms that a sin gle pronunciation that is consistently used both during train ing and during recognition is to be preferred over multiple pronunciations derived from careful phonetic transcriptions. This is in line with McAllaster and Gillick's [5] findings, which also suggest that consistency between-potentially inaccurate-symbolic representations used in training and recognition is to be preferred over accurate representations in the training phase if these cannot be carried over to the recognition phase. The focus of the present study was on implicit m od elling of long-span coarticulation effects by using syllablelength models instead of the context-dependent phones that conventional automatic speech recognisers use. We expected Baum-Welch reestimation of these models to capture pho netic detail that cannot be accounted for by means of ex plicit pronunciation variation modelling at the level of pho netic transcriptions in the recognition lexicon. Because of the changes we observed between the initial and the retrained syllable models (see Figures 8 and 9), we do believe that re training the observation densities incorporates coarticula tion effects into the longer-length models. However, the cor responding recognition results (see Figures 6 and 7) show that this is not sufficient for capturing the most im portant effects of pronunciation variation at the syllable level. Green berg [15], amongst other authors, has shown that while syl lables are seldom deleted completely, they do display consid erable variation in the identity and num ber of the phonetic symbols that best reflect their pronunciation. Greenberg and Chang [26] showed that there is a clear relation between recognition accuracy and the degree to which the acoustic and lexical models reflect the actual pronunciation. Not sur prisingly, the match (or mismatch) between the knowledge captured in the models on the one hand and the actual ar ticulation is dependent on linguistic (e.g., prosody, context) as well as nonlinguistic (e.g., speaker identity, speaking rate) factors. Sun and Deng [27] tried to model the variation in terms of articulatory features that are allowed to overlap in time and change asynchronously. Their recognition results on TIMIT are m uch worse than what we obtained with a more conventional approach.
We believe that the aforementioned problems are caused by the fact that part of the variation in speech (e.g., phone deletions and insertions) results in very different trajectories in the acoustic parameter space. These differently shaped tra jectories are not easy to model with observation densities if the model topology is identical for all variants. We believe that pronunciation variation could be modelled better by us ing syllable models with parallel paths that represent differ ent pronunciation variants, and by reestimating these paral lel paths to better incorporate the dynamic nature of articu lation. Therefore, our future research will focus on strategies for developing m ultipath model topologies for syllables.

CONCLUSIONS
This paper contrasted recognition results obtained using longer-length acoustic models for Dutch read speech from a library for the blind with recognition results achieved on American English read speech from TIMIT. The topologies and model parameters of the longer-length models were ini tialised by concatenating the triphone models underlying their canonical transcriptions. The initialised models were then trained further to incorporate the spectral and temporal dependencies in speech into the models. W hen using m an ually labelled speech to train the triphones, mixed-model recognisers comprising syllable-length and phoneme-length models substantially outperform ed them. At first sight, these results seemed to corroborate the claim that properly ini tialised and retrained longer-length acoustic models capture a significant am ount of pronunciation variation. However, detailed analyses showed that the effect of training syllable sized models further is negligible if canonical representations of the syllables are initialised with triphones trained with the canonical transcriptions of the training corpus. Therefore, we conclude that single-path syllable models that borrow their topology from a sequence of triphones cannot capture the pronunciation variation phenom ena that hinder recog nition performance the most.