Vocabulary Decomposition for Estonian Open Vocabulary Speech Recognition

نویسندگان

  • Antti Puurula
  • Mikko Kurimo
چکیده

Speech recognition in many morphologically rich languages suffers from a very high out-of-vocabulary (OOV) ratio. Earlier work has shown that vocabulary decomposition methods can practically solve this problem for a subset of these languages. This paper compares various vocabulary decomposition approaches to open vocabulary speech recognition, using Estonian speech recognition as a benchmark. Comparisons are performed utilizing large models of 60000 lexical items and smaller vocabularies of 5000 items. A large vocabulary model based on a manually constructed morphological tagger is shown to give the lowest word error rate, while the unsupervised morphology discovery method Morfessor Baseline gives marginally weaker results. Only the Morfessor-based approach is shown to adequately scale to smaller vocabulary sizes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Limited-Vocabulary Estonian Continuous Speech Recognition System using Hidden Markov Models

The article presents a limited-vocabulary speaker independent continuous Estonian speech recognition system based on hidden Markov models. The system is trained using an annotated Estonian speech database of 60 speakers, approximately 4 hours in duration. Words are modelled using clustered triphones with multiple Gaussian mixture components. The system is evaluated using a number recognition ta...

متن کامل

Open and extendable speech recognition application architecture for mobile environments

This paper describes a cloud-based speech recognition architecture primarily intended for mobile environments. The system consists of a speech recognition server and a client for the Android mobile operating system. The system enables to implement Android’s Voice Input functionality for languages that are not supported by the default Google implementation. The architecture supports both large v...

متن کامل

Estonian Large Vocabulary Speech Recognition System for Radiology

This paper describes implementation and evaluation of an Estonian large vocabulary continuous speech recognition system prototype for the radiology domain. We used a 44 million word corpus of radiology reports to build a word trigram language model. We recorded a test set of dictated radiology reports using ten radiologists. Using speaker independent speech recognition, we achieved a 9.8% word ...

متن کامل

Large Vocabulary Continuous Speech Recognition for Estonian Using Morpheme Classes

This paper describes development of a large vocabulary continuous speaker independent speech recognition system for Estonian. Estonian is an agglutinative language and the number of different word forms is very large, in addition, the word order is relatively unconstrained. To achieve a good language coverage, we use pseudo-morphemes as basic units in a statistical trigram language model. To im...

متن کامل

Analysis of Morph-Based Speech Recognition and the Modeling of Out-of-Vocabulary Words Across Languages

We analyze subword-based language models (LMs) in large-vocabulary continuous speech recognition across four “morphologically rich” languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. By estimating n-gram LMs over sequences of morphs instead of words, better vocabulary coverage and reduced data sparsity is obtained. Standard word LMs suffer from high out-of-vocabulary (OOV) r...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007