Average-Voice-Based Speech Synthesis

نویسنده

  • Junichi Yamagishi
چکیده

This thesis describes a novel speech synthesis framework " Average-Voice-based Speech Synthesis. " By using the speech synthesis framework, synthetic speech of arbitrary target speakers can be obtained robustly and steadily even if speech samples available for the target speaker are very small. This speech synthesis framework consists of speaker normalization algorithm for the parameter clustering, speaker normalization algorithm for the parameter estimation, the transformation/adaptation part, and modification part of the rough transformation. In the parameter clustering using decision-tree-based context clustering techniques for average voice model, the nodes of the decision tree do not always have training data of all speakers, and some nodes have data from only one speaker. This speaker-biased node causes degradation of quality of average voice and synthetic speech after speaker adaptation, especially in prosody. Therefore, we firstly propose a new context clustering technique, named " shared-decision-tree-based context clustering " to overcome this problem. Using this technique, every node of the decision tree always has training data from all speakers included in the training speech database. As a result, we can construct decision tree common to all training speakers and each distribution of the node always reflects the statistics of all speakers. However, when training data of each training speaker differs widely, the distributions of the node often have bias depending on speaker and/or gender and this will degrade the quality of synthetic speech. Therefore, we incorporate " speaker adaptive training " into the parameter estimation procedure of average voice model to reduce the influence of speaker dependence. In the speaker adaptive training, the speaker difference between training speaker's voice and average voice is assumed to be expressed as a simple linear regres-i ii sion function of mean vector of the distribution and a canonical average voice model is estimated using the assumption. In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone duration. Therefore, we utilize a framework of " hidden semi-Markov model " (HSMM) which is an HMM having explicit state duration distributions and we propose an HSMM-based model adaptation algorithm to simultaneously transform both state output and state duration distributions. Furthermore, we also propose an HSMM-based speaker adaptive training algorithm to normalize both state output and state duration distributions of average voice model at the same time. Finally, we explore several speaker adaptation algorithms to transform more effectively the average voice …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Acoustic model training based on linear transformation and MAP modification for HSMM-based speech synthesis

This paper describes the use of combined linear regression and expost MAP methods for average-voice-based speech synthesis system based on HMM. To generate more natural sounding speech using the average-voice-based speech synthesis system when a large amount of training data is available, we apply ex-post MAP estimation after the linear transformation based adaptation. We investigate how the am...

متن کامل

Conversational spontaneous speech synthesis using average voice model

This paper describes conversational spontaneous speech synthesis based on hidden Markov model (HMM). To reduce the amount of data required for model training, we utilize an average-voice-based speech synthesis framework, which has been shown to be effective for synthesizing speech with arbitrary speaker’s voice using a small amount of training data. We examine several kinds of average voice mod...

متن کامل

Improved average-voice-based speech synthesis using gender-mixed modeling and a parameter generation algorithm considering GV

For constructing a speech synthesis system which can achieve diverse voices, we have been developing a speaker independent approach of HMM-based speech synthesis in which statistical average voice models are adapted to a target speaker using a small amount of speech data. In this paper, we incorporate a high-quality speech vocoding method STRAIGHT and a parameter generation algorithm with globa...

متن کامل

Study on Unit-Selection and Statistical Parametric Speech Synthesis Techniques

One of the interesting topics on multimedia domain is concerned with empowering computer in order to speech production. Speech synthesis is granting human abilities to the computer for speech production. Data-based approach and process-based approach are the two main approaches on speech synthesis. Each approach has its varied challenges. Unit-selection speech synthesis and statistical parametr...

متن کامل

Text-to-speech synthesis with arbitrary speaker's voice from average voice

This paper describes a technique for synthesizing speech with any desired voice. The technique is based on an HMM-based text-to-speech (TTS) system and MLLR adaptation algorithm. To generate speech of an arbitrarily given target speaker, speaker-independent speech units, i.e., average voice models, is adapted to the target speaker using MLLR framework. In addition to spectrum and pitch adaptati...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006