Exploiting Features for Data Source Quality Estimation

نویسندگان

  • Manas Joglekar
  • Theodoros Rekatsinas
  • Hector Garcia-Molina
  • Aditya G. Parameswaran
  • Christopher Ré
چکیده

We revisit data fusion, i.e., the problem of integrating noisy data from multiple sources by estimating the source accuracies, and show that the simple model of logistic regression can capture most existing approaches for solving data fusion. This allows us to put data fusion on a solid statistical footing and obtain solutions with rigorous theoretical guarantees. Expanding on logistic regression, we introduce SLiMFast, a framework that converts data fusion to a learning and inference problem over discriminative probabilistic models. In contrast to previous approaches that rely on complex generative models, discriminative models allow us to decouple the specification of a data fusion model from the algorithm used to learn the model’s parameters. This allows us to extend data fusion to take into account domain-specific features that are indicative of the accuracy of data sources, and design data fusion approaches that yield source accuracy estimates with 5× lower error than competing baselines. We also design an optimizer to automatically select the best algorithm for learning the model’s parameters. We validate our optimizer on multiple real datasets and show that it chooses the best algorithm for learning in almost all cases.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document-level translation quality estimation: exploring dicsourse an pseudo-references

Predicting the quality of machine translations is a challenging topic. Quality estimation (QE) of translations is based on features of the source and target texts (without the need for human references), and on supervised machine learning methods to build prediction models. Engineering well-performing features is therefore crucial in QE modelling. Several features have been used so far, but the...

متن کامل

Estimation of kinematic source parameters and frequency independent shear wave quality factor around Bushehr

In this paper, the shear wave quality factor and source parameters in the near field are estimated by analyzing the acceleration data in Zagros region. Accelerograms recorded by Building and Houses Research Center strong ground motion network have been used. The data have been considered with the magnitude of 4.7 to 6.3 collected from 1999 to 2014. In this approach, the theoretical S-wave displ...

متن کامل

TranscRater: a Tool for Automatic Speech Recognition Quality Estimation

We present TranscRater, an open-source tool for automatic speech recognition (ASR) quality estimation (QE). The tool allows users to perform ASR evaluation bypassing the need of reference transcripts and confidence information, which is common to current assessment protocols. TranscRater includes: i) methods to extract a variety of quality indicators from (signal, transcription) pairs and ii) m...

متن کامل

On the mutual information of glottal source estimation techniques for the automatic detection of speech pathologies

detection of speech pathologies by exploiting the estimation of the glottal source. Three methods of estimation are compared and time and spectral features are extracted. The relevancy of these features is assessed by means of information theory-based measures. This allows an intuitive interpretation in terms of discrimination power and redundancy between the features. It is discussed which fea...

متن کامل

Video quality monitoring for mobile multicast peers using distributed source coding

We consider a peer-to-peer multicast video streaming system in which untrusted intermediaries transcode video streams for heterogeneous mobile peers. Many different legitimate versions of the video might exist. However, there is the risk that the untrusted intermediaries might tamper with the video content. Quality estimation and tampering detection are important in this scenario. We propose th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1512.06474  شماره 

صفحات  -

تاریخ انتشار 2015