Visual speech recognition: aligning terminologies for better understanding
نویسندگان
چکیده
We are at an exciting time for machine lipreading. Traditional research stemmed from the adaptation of audio recognition systems. But now, the computer vision community is also participating. This joining of two previously disparate areas with different perspectives on computer lipreading is creating opportunities for collaborations, but in doing so the literature is experiencing challenges in knowledge sharing due to multiple uses of terms and phrases and the range of methods for scoring results. In particular we highlight three areas with the intention to improve communication between those researching lipreading; the effects of interchanging between speech reading and lipreading; speaker dependence across train, validation, and test splits; and the use of accuracy, correctness, errors, and varying units (phonemes, visemes, words, and sentences) to measure system performance. We make recommendations as to how we can be more consistent.
منابع مشابه
Visual Speech: A Physiological or Behavioural Biometric?
This paper addresses an issue concerning the current classification of biometrics into either physiological or behavioural. We offer clarification on this issue and propose additional qualifications for a biometric to be classed as behavioural. It is observed that dynamics play a key role in the qualification of these terminologies. These are illustrated by practical experiments based around vi...
متن کاملTightly integrated spoken language understanding using word-to-concept translation
This paper discusses an integrated spoken language understanding method using a statistical translation model from words to semantic concepts. The translation model is an N-gram-based model that can easily be integrated with speech recognition. It can be trained using annotated corpora where only sentencelevel alignments between word sequences and concept sets are available, by automatic alignm...
متن کاملFast Automatic Alignment of Video and Text for Search/Names and Faces
We propose a novel way of aligning the audio/video and text streams, which is faster than conventional speech recognition, and requires no supervision. Multimedia of this form includes news broadcast with summaries, parliament proceedings and court trials with transcripts, etc. In addition to applications to video search using the text based indexing, we also show how we can annotate the video ...
متن کاملImproved Bayesian Training for Context-Dependent Modeling in Continuous Persian Speech Recognition
Context-dependent modeling is a widely used technique for better phone modeling in continuous speech recognition. While different types of context-dependent models have been used, triphones have been known as the most effective ones. In this paper, a Maximum a Posteriori (MAP) estimation approach has been used to estimate the parameters of the untied triphone model set used in data-driven clust...
متن کاملSpeaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1710.01292 شماره
صفحات -
تاریخ انتشار 2017