Using Prosody in ASR: the Segmentation of Broadcast Radio News
نویسندگان
چکیده
This study explores how prosodic information can be used in Automatic Speech Recognition (ASR). A system was built which automatically identifies topic boundaries in a corpus of broadcast radio news. We evaluate the effectiveness of different types of features, including textual, durational, F0, Tilt and ToBI features in that system. These features were suggested by a review of the literature on how topic structure is indicated by humans and recognised by both humans and machines from both a linguistic and natural language processing standpoint. In particular, we investigate whether acoustic cues to prosodic information can be used directly to indicate topic structure, or whether it is better to derive discourse structure from intonational events, such as ToBI events, in a manner suggested by Steedman’s (2000) theory, among others. It was found that the global F0 properties of an utterance (mean and maximum F0) and textual features (based on Hearst’s (1997) lexical scores and cue phrases) were effective in recognising topic boundaries on their own whereas all other features investigated were not. Performance using Tilt and ToBI features was disappointing, although this could have been because of inaccuracies in estimating these parameters. We suggest that different acoustic cues to prosody are more effective in recognising discourse information at certain levels of discourse structure than others. The identification of higher level structure is informed by the properties of lower level structure. Although the findings of this study were not conclusive on this issue, we propose that prosody in ASR and synthesis should be represented in terms of the intonational events relevant to each level of discourse structure. Further, at the level of topic structure, a taxonomy of events is needed to describe the global F0 properties of each utterance that makes up that structure.
منابع مشابه
Modeling Broadcast News Prosody Using Conditional Random Fields for Story Segmentation
This paper proposes to model broadcast news prosody using conditional random fields (CRF) for news story segmentation. Broadcast news has both editorial prosody and speech prosody that convey essential structural information for story segmentation. Hence we extract prosodic features, including pause duration, pitch, intensity, rapidity, speaker change and music, for a sequence of boundary candi...
متن کاملSegmentation of Automatically Transcribed Broadcast News Text
Expertise in the automatic transcription of broadcast speech has progressed to the point of being able to use the resulting transcripts for information retrieval purposes. In this paper, we describe the Segmentation system used by Dragon Systems in the Segmentation task of the 1998 TDT evaluation, highlighting improvements made since the September 1998 dryrun. Segmentation of closed-caption and...
متن کاملThe czech speech and prosody database both for ASR and TTS purposes
This paper describes a preparation of the first large Czech prosodic database which should be useful both in automatic speech recognition (ASR) and text-to-speech (TTS) synthesis. In the area of ASR we intend to use it for an automatic punctuation annotation, in the area of TTS for building a prosodic module for the Czech high-quality synthesis. The database is based on the Czech Radio&TV Broad...
متن کاملThe need to create a media block for the convergence of overseas news networks
As a general diplomacy arm of the Islamic Republic of Iran, VoSiMa has extensive activities in international broadcasting of its radio and television programs. These programs are broadcast in different languages, such as English, French, Azeri, Arabic, and ... for regional and transnational audiences. The large volume of the organization's international activities is in the form of news and new...
متن کاملCombining Words and Speech Prosody for Automatic Topic Segmentation
We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topic units. The approach combines hidden Markov models, statistical language models, and prosody-based decision trees. Lexical information is obtained from a speech recognizer, and prosodic features are extracted automatically from speech waveforms. We evaluate our approach o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002