Cardinality Estimation Meets Good-Turing

نویسندگان

Reuven Cohen

Liran Katzir

Aviv Yehezkel

چکیده

Cardinality estimation algorithms receive a stream of elements whose order might be arbitrary, with possible repetitions, and return the number of distinct elements. Such algorithms usually seek to minimize the required storage and processing at the price of inaccuracy in their output. Real-world applications of these algorithms are required to process large volumes of monitored data, making it impractical to collect and analyze the entire input stream. In such cases, it is common practice to sample and process only a small part of the stream elements. This paper presents and analyzes a generic algorithm for combining every cardinality estimation algorithm with a sampling process. We show that the proposed sampling algorithm does not affect the estimator’s asymptotic unbiasedness, and we analyze the sampling effect on the estimator’s variance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards a Cardinality Theorem for Finite Automata

Kummer’s cardinality theorem states that a language is recursive if a Turing machine can exclude for any n words one of the n + 1 possibilities for the number of words in the language. This paper gathers evidence that the cardinality theorem might also hold for finite automata. Three reasons are given. First, Beigel’s nonspeedup theorem also holds for finite automata. Second, the cardinality th...

متن کامل

Quantity Estimation Based on Numerical Cues in the Mealworm Beetle (Tenebrio molitor)

In this study, we used a biologically relevant experimental procedure to ask whether mealworm beetles (Tenebrio molitor) are spontaneously capable of assessing quantities based on numerical cues. Like other insect species, mealworm beetles adjust their reproductive behavior (i.e., investment in mate guarding) according to the perceived risk of sperm competition (i.e., probability that a female ...

متن کامل

Weak Cardinality Theorems for First-Order Logic

Kummer’s cardinality theorem states that a language is recursive if a Turing machine can exclude for any n words one of the n+ 1 possibilities for the number of words in the language. It is known that this theorem does not hold for polynomial-time computations, but there is evidence that it holds for finite automata: at least weak cardinality theorems hold for finite automata. This paper shows ...

متن کامل

Sequence Probability Estimation for Large Alphabets

We consider the problem of estimating the probability of an observed string drawn i.i.d. from an unknown distribution. The key feature of our study is that the length of the observed string is assumed to be of the same order as the size of the underlying alphabet. In this setting, many letters are unseen and the empirical distribution tends to overestimate the probability of the observed letter...

متن کامل

Always Good Turing: Asymptotically Optimal Probability Estimation

While deciphering the Enigma code, Good and Turing derived an unintuitive, yet effective, formula for estimating a probability distribution from a sample of data. We define the attenuation of a probability estimator as the largest possible ratio between the per-symbol probability assigned to an arbitrarily long sequence by any distribution, and the corresponding probability assigned by the esti...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Big Data Research

دوره 9 شماره

صفحات -

تاریخ انتشار 2017

Cardinality Estimation Meets Good-Turing

نویسندگان

چکیده

منابع مشابه

Towards a Cardinality Theorem for Finite Automata

Quantity Estimation Based on Numerical Cues in the Mealworm Beetle (Tenebrio molitor)

Weak Cardinality Theorems for First-Order Logic

Sequence Probability Estimation for Large Alphabets

Always Good Turing: Asymptotically Optimal Probability Estimation

عنوان ژورنال:

اشتراک گذاری