Learning Balanced Mixtures of Discrete Distributions with Small Sample
نویسنده
چکیده
We study the problem of partitioning a small sample of n individuals from a mixture of k product distributions over a Boolean cube {0, 1}K according to their distributions. Each distribution is described by a vector of allele frequencies in RK . Given two distributions, we use γ to denote the average l2 distance in frequencies across K dimensions, which measures the statistical divergence between them. We study the case assuming that bits are independently distributed across K dimensions. This work demonstrates that, for a balanced input instance for k = 2, a certain graph-based optimization function returns the correct partition with high probability, where a weighted graph G is formed over n individuals, whose pairwise hamming distances between their corresponding bit vectors define the edge weights, so long as K = (ln n/γ ) and K n = ̃ ( ln n/γ 2 ) . The function computes a maximum-weight balanced cut of G, where the weight of a cut is the sum of the weights across all edges in the cut. This result demonstrates a nice property in the high-dimensional feature space: one can trade off the number of features that are required with the size of the sample to accomplish certain tasks like clustering.
منابع مشابه
Learning Mixtures of Discrete Product Distributions using Spectral Decompositions
We study the problem of learning a distribution from samples, when the underlying distribution is a mixture of product distributions over discrete domains. This problem is motivated by several practical applications such as crowdsourcing, recommendation systems, and learning Boolean functions. The existing solutions either heavily rely on the fact that the number of mixtures is finite or have s...
متن کاملMarginal Likelihood Integrals for Mixtures of Independence Models
Inference in Bayesian statistics involves the evaluation of marginal likelihood integrals. We present algebraic algorithms for computing such integrals exactly for discrete data of small sample size. Our methods apply to both uniform priors and Dirichlet priors. The underlying statistical models are mixtures of independent distributions, or, in geometric language, secant varieties of Segre-Vero...
متن کاملLearning Mixtures of Distributions
Learning Mixtures of Distributions by Kamalika Chaudhuri Doctor of Philosophy in Computer Science University of California, Berkeley Professor Satish Rao, Chair This thesis studies the problem of learning mixtures of distributions, a natural formalization of clustering. A mixture of distributions is a collection of distributions D = {D1, . . .DT}, and mixing weights, {w1, . . . , wT} such that ...
متن کاملExact Evaluation of Marginal Likelihood Integrals
Inference in Bayesian statistics involves the evaluation of marginal likelihood integrals. We present algebraic algorithms for computing such integrals exactly for discrete data of small sample size. The underlying statistical models are mixtures of independent distributions, or, in geometric language, secant varieties of Segre-Veronese varieties.
متن کاملOn Spectral Learning of Mixtures of Distributions
We consider the problem of learning mixtures of distributions via spectral methods and derive a tight characterization of when such methods are useful. Specifically, given a mixture-sample, let μi, Ci, wi denote the empirical mean, covariance matrix, and mixing weight of the i-th component. We prove that a very simple algorithm, namely spectral projection followed by single-linkage clustering, ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/0802.1244 شماره
صفحات -
تاریخ انتشار 2008