Connected Text Recognition Using Layered HMMs and Token Passing

نویسنده

  • Peter Ingels
چکیده

We present a novel approach to lexical error recovery on textual input. An advanced robust tokenizer has been implemented that can not only correct spelling mistakes, but also recover from segmentation errors. Apart from the orthographic considerations taken, the tokenizer also makes use of linguistic expectations extracted from a training corpus. The idea is to arrange Hidden Markov Models (HMM) in multiple layers where the HMMs in each layer are responsible for different aspects of the processing of the input. We report on experimental evaluations with alternative probabilistic language models to guide the lexical error recovery process.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data driven subword unit modeling for speech recognition and its application to interactive reading tutors

This paper proposes a novel token-passing search architecture for supporting subword unit based speech recognition and a corresponding algorithm based on the well-known LZW text compression method to determine a vocabulary of subword units in an unsupervised manner. We compare our subword unit selection algorithm to an existing approach based on Minimum Description Length (MDL) modeling and als...

متن کامل

A Robust Text Processing Technique Applied to Lexical Error Recovery

This thesis addresses automatic lexical error recovery and tokenization of corrupt text input. We propose a technique that can automatically correct mis-spellings, segmentation errors and real-word errors in a unified framework that uses both a model of language production and a model of the typing behavior, and which makes tokenization part of the recovery process. The typing process is modele...

متن کامل

An Improved Token-Based and Starvation Free Distributed Mutual Exclusion Algorithm

Distributed mutual exclusion is a fundamental problem of distributed systems that coordinates the access to critical shared resources. It concerns with how the various distributed processes access to the shared resources in a mutually exclusive manner. This paper presents fully distributed improved token based mutual exclusion algorithm for distributed system. In this algorithm, a process which...

متن کامل

Triphone Based Continuous Speech Recognition System for Turkish Language Using Hidden Markov Model

This paper introduces a system which is designed to perform a relatively accurate transcription of speech and in particular, continuous speech recognition based on triphone model for Turkish language. Turkish is generally different from Indo-European languages (English, Spanish, French, German etc.) by its agglutinative and suffixing morphology. Therefore vocabulary growth rate is very high and...

متن کامل

Text-constrained speaker recognition on a text-independent task

We present an approach to speaker recognition in the textindependent domain of conversational telephone speech using a text-constrained system designed to employ select highfrequency keywords in the speech stream. The system uses speaker word models generated via Hidden Markov Models (HMMs) — a departure from the traditional Gaussian Mixture Model (GMM) approach dominant in text-independent wor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره cmp-lg/9607036  شماره 

صفحات  -

تاریخ انتشار 1996