Using VTLN matrices for rapid and computationally-efficient speaker adaptation with robustness to first-pass transcription errors

نویسندگان

  • Shakti Prasad Rath
  • Srinivasan Umesh
  • Achintya Kumar Sarkar
چکیده

In this paper, we propose to combine the rapid adaptation capability of conventional Vocal Tract Length Normalization (VTLN) with the computational efficiency of transform-based adaptation such as MLLR or CMLLR. VTLN requires the estimation of only one parameter and is, therefore, most suited for the cases where there is little adaptation data (i.e. rapid adaptation). In contrast, transform-based adaptation methods require the estimation of matrices. However, the drawback of conventional VTLN is that it is computationally expensive since it requires multiple spectral-warping to generate VTLN-warped features. We have recently shown that VTLN-warping can be implemented by a linear-transformation (LT) of the conventional MFCC features. These LTs are analytically pre-computed and stored. In this frame-work of LT VTLN, computational complexity of VTLN is similar to transform-based adaptation since warp-factor estimation can be done using the same sufficient statistics as that are used in CMLLR. We show that VTLN provides significant improvement in performance when there is small adaptation data as compared to transform-based adaptation methods. We also show that the use of an additional decorrelating transform, MLLT, along with the VTLN-matrices, gives performance that is better than MLLR and comparable to SAT with MLLT even for large adaptation data. Further we show that in the mismatched train and test case (i.e. poor first-pass transcription), VTLN provides significant improvement over the transform-based adaptation methods. We compare the performances of different methods on the WSJ, the RM and the TIDIGITS databases.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Framework Of Feature Based Adaptation For Statistical Speech Synthesis And Recognition

The advent of statistical parametric speech synthesis has paved new ways to a unified framework for hidden Markov model (HMM) based text to speech synthesis (TTS) and automatic speech recognition (ASR). The techniques and advancements made in the field of ASR can now be adopted in the domain of synthesis. Speaker adaptation is a well-advanced topic in the area of ASR, where the adaptation data ...

متن کامل

Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC

Vocal Tract Length Normalization (VTLN) for standard filterbank-based Mel Frequency Cepstral Coefficient (MFCC) features is usually implemented by warping the center frequencies of the Mel filterbank, and the warping factor is estimated using the maximum likelihood score (MLS) criterion (Lee and Rose, 1998). A linear transform (LT) equivalent for frequency warping (FW) would enable more efficie...

متن کامل

Fast Speaker Normalization and Adaptation based on BIC for Meeting Speech Recognition

This paper presents a unified method for speech segmentation, speaker normalization of spectral features, and speaker adaptation of acoustic model for efficient meeting speech recognition. In the proposed method, input speech is segmented based on BIC (Bayesian Information Criterion), and compared against each speaker’s statistic in the training corpus of the acoustic model based on the BIC. Fa...

متن کامل

Tree-based estimation of speaker characteristics for speech recognition

Speaker adaptation by means of adjustment of speaker characteristic properties, such as vocal tract length, has the important advantage compared to conventional adaptation techniques that the adapted models are guaranteed to be realistic if the description of the properties are. One problem with this approach is that the search procedure to estimate them is computationally heavy. We address the...

متن کامل

Online Speaker Adaptation with Pre-Computed FMLLR Transformations

This paper presents a memory efficient single pass speech recognizer that makes use of pre-computed FMLLR transformations for online speaker adaptation. For that purpose we apply unsupervised segment clustering to the training corpus, create a transformation matrix for each cluster, and train a text-independentGaussian mixture classifier for cluster selection during runtime. We use the RWTH Aac...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009