Cardinality Estimation Meets Good-Turing
نویسندگان
چکیده
Cardinality estimation algorithms receive a stream of elements whose order might be arbitrary, with possible repetitions, and return the number of distinct elements. Such algorithms usually seek to minimize the required storage and processing at the price of inaccuracy in their output. Real-world applications of these algorithms are required to process large volumes of monitored data, making it impractical to collect and analyze the entire input stream. In such cases, it is common practice to sample and process only a small part of the stream elements. This paper presents and analyzes a generic algorithm for combining every cardinality estimation algorithm with a sampling process. We show that the proposed sampling algorithm does not affect the estimator’s asymptotic unbiasedness, and we analyze the sampling effect on the estimator’s variance.
منابع مشابه
Towards a Cardinality Theorem for Finite Automata
Kummer’s cardinality theorem states that a language is recursive if a Turing machine can exclude for any n words one of the n + 1 possibilities for the number of words in the language. This paper gathers evidence that the cardinality theorem might also hold for finite automata. Three reasons are given. First, Beigel’s nonspeedup theorem also holds for finite automata. Second, the cardinality th...
متن کاملQuantity Estimation Based on Numerical Cues in the Mealworm Beetle (Tenebrio molitor)
In this study, we used a biologically relevant experimental procedure to ask whether mealworm beetles (Tenebrio molitor) are spontaneously capable of assessing quantities based on numerical cues. Like other insect species, mealworm beetles adjust their reproductive behavior (i.e., investment in mate guarding) according to the perceived risk of sperm competition (i.e., probability that a female ...
متن کاملWeak Cardinality Theorems for First-Order Logic
Kummer’s cardinality theorem states that a language is recursive if a Turing machine can exclude for any n words one of the n+ 1 possibilities for the number of words in the language. It is known that this theorem does not hold for polynomial-time computations, but there is evidence that it holds for finite automata: at least weak cardinality theorems hold for finite automata. This paper shows ...
متن کاملSequence Probability Estimation for Large Alphabets
We consider the problem of estimating the probability of an observed string drawn i.i.d. from an unknown distribution. The key feature of our study is that the length of the observed string is assumed to be of the same order as the size of the underlying alphabet. In this setting, many letters are unseen and the empirical distribution tends to overestimate the probability of the observed letter...
متن کاملAlways Good Turing: Asymptotically Optimal Probability Estimation
While deciphering the Enigma code, Good and Turing derived an unintuitive, yet effective, formula for estimating a probability distribution from a sample of data. We define the attenuation of a probability estimator as the largest possible ratio between the per-symbol probability assigned to an arbitrarily long sequence by any distribution, and the corresponding probability assigned by the esti...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Big Data Research
دوره 9 شماره
صفحات -
تاریخ انتشار 2017