Exploiting shared correlations in probabilistic databases

نویسندگان

  • Prithviraj Sen
  • Amol Deshpande
  • Lise Getoor
چکیده

There has been a recent surge in work in probabilistic databases, propelled in large part by the huge increase in noisy data sources — from sensor data, experimental data, data from uncurated sources, and many others. There is a growing need for database management systems that can efficiently represent and query such data. In this work, we show how data characteristics can be leveraged to make the query evaluation process more efficient. In particular, we exploit what we refer to as shared correlations where the same uncertainties and correlations occur repeatedly in the data. Shared correlations occur mainly due to two reasons: (1) Uncertainty and correlations usually come from general statistics and rarely vary on a tuple-to-tuple basis; (2) The query evaluation procedure itself tends to re-introduce the same correlations. Prior work has shown that the query evaluation problem on probabilistic databases is equivalent to a probabilistic inference problem on an appropriately constructed probabilistic graphical model (PGM). We leverage this by introducing a new data structure, called the random variable elimination graph (rv-elim graph) that can be built from the PGM obtained from query evaluation. We develop techniques based on bisimulation that can be used to compress the rv-elim graph exploiting the presence of shared correlations in the PGM, the compressed rv-elim graph can then be used to run inference. We validate our methods by evaluating them empirically and show that even with a few shared correlations significant speed-ups are possible.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Query Evaluation over Temporally Correlated Probabilistic Streams

Many real world applications such as sensor networks and other monitoring applications naturally generate probabilistic streams that are highly correlated in both time and space. Query processing over such streaming data must be cognizant of these correlations, since they significantly alter the final query results. Several prior works have suggested approaches to handling correlations in proba...

متن کامل

Graphical Models for Uncertain Data

Graphical models are a popular and well-studied framework for compact representation of a joint probability distribution over a large number of interdependent variables, and for efficient reasoning about such a distribution. They have been proven useful in a wide range of domains from natural language processing to computer vision to bioinformatics. In this chapter, we present an approach to us...

متن کامل

Chapter 4 GRAPHICAL MODELS FOR UNCERTAIN DATA

Graphical models are a popular and well-studied framework for compact representation of a joint probability distribution over a large number of interdependent variables, and for efficient reasoning about such a distribution. They have been proven useful in a wide range of domains from natural language processing to computer vision to bioinformatics. In this chapter, we present an approach to us...

متن کامل

Sharing of Probabilistically Correlated Data in Peer-to-Peer Networks

The impact of Peer-to-Peer (P2P) networks on the Internet landscape is undisputed. It has led to a series of new applications, e.g., as part of the socalled Web 2.0. The shift from the classical client-server based paradigm of the Internet, with a clear distinction between information providers and consumers, towards consumers sharing information among each other led to the rise of the P2P para...

متن کامل

Scaling Lifted Probabilistic Inference and Learning Via Graph Databases

Over the past decade, exploiting relations and symmetries within probabilistic models has been proven to be surprisingly effective at solving large scale data mining problems. One of the key operations inside these lifted approaches is counting be it for parameter/structure learning or for efficient inference. Typically, however, they just count exploiting the logical structure using adhoc oper...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • PVLDB

دوره 1  شماره 

صفحات  -

تاریخ انتشار 2008