An Augmented Multi-tiered Classifier for Instantaneous Multi-modal Voice Activity Detection

نویسندگان

  • M. Burlick
  • D. Dimitriadis
  • E. Zavesky
چکیده

As mobile devices, intelligent displays, and home entertainment systems permeate digital markets, the desire for users to interact through spoken and visual modalities similarly grows. Previous interactive systems limit voice activity detection (VAD) to the acoustic domain alone, but the incorporation of visual features has shown great improvement in performance accuracy. When employing both acoustic and visual (AV) information a central recurring question is “how does one efficiently fuse modalities”. This work combines traditional approaches, feature fusion and decision fusion, with independent modality classifiers and combined intermediary decisions with raw features as inputs to a second stage classifier. Our augmented multi-tier classification system concatenates the output of a set of base classifiers with the original fused features for a final classifier. Experiments over various noise conditions show average relative improvements of 4.1-5% on the CUAVE [1] and 2.5-11% and MOBIO [2] datasets using majority voters and LDA respectively.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Decision fusion by boosting method for multi-modal voice activity detection

In this paper, we propose a multi-modal voice activity detection system (VAD) that uses audio and visual information. In multi-modal (speech) signal processing, there are two methods for fusing the audio and the visual information: concatenating the audio and visual features, and employing audioonly and visual-only classifiers, then fusing the unimodal decisions. We investigate the effectivenes...

متن کامل

Cross-Modal Supervision for Learning Active Speaker Detection in Video

In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion facial expressions and gesticulations associated with speaking. We further improve a generic model for acti...

متن کامل

Towards Speaker Detection using Lips Movements for Human-Machine Multiparty Dialogue

This paper explores the use of lips movements for the purpose of speaker and voice activity detection, a task that is essential in multi-modal multiparty human machine dialogue. The task aims at detecting who and when someone is speaking out of a set of persons. A multiparty dialogue consisting of 4 speakers is audiovisually recorded and then annotated for speaker and speech/silence segments. L...

متن کامل

On the improvement of multimodal voice activity detection

As mobile devices, intelligent displays, and home entertainment systems permeate digital markets, the desire for users to interact through spoken and visual modalities similarly grows. Previous interactive systems limit voice activity detection (VAD) to the acoustic domain alone, but the incorporation of visual features has shown great improvement in performance accuracy. When employing both ac...

متن کامل

Damage detection of multi-girder bridge superstructure based on the modal strain approaches

The research described in this paper focuses on the application of modal strain techniques on a multi-girder bridge superstructure with the objectives of identifying the presence of damage and detecting false damage diagnosis for such structures. The case study is a one-third scale model of a slab-on-girder composite bridge superstructure, comprised of a steel-free concrete deck with FRP rebars...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012