Learning Balanced Mixtures of Discrete Distributions with Small Sample

نویسنده

  • Shuheng Zhou
چکیده

We study the problem of partitioning a small sample of n individuals from a mixture of k product distributions over a Boolean cube {0, 1}K according to their distributions. Each distribution is described by a vector of allele frequencies in RK . Given two distributions, we use γ to denote the average l2 distance in frequencies across K dimensions, which measures the statistical divergence between them. We study the case assuming that bits are independently distributed across K dimensions. This work demonstrates that, for a balanced input instance for k = 2, a certain graph-based optimization function returns the correct partition with high probability, where a weighted graph G is formed over n individuals, whose pairwise hamming distances between their corresponding bit vectors define the edge weights, so long as K =  (ln n/γ ) and K n = ̃ ( ln n/γ 2 ) . The function computes a maximum-weight balanced cut of G, where the weight of a cut is the sum of the weights across all edges in the cut. This result demonstrates a nice property in the high-dimensional feature space: one can trade off the number of features that are required with the size of the sample to accomplish certain tasks like clustering.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Mixtures of Discrete Product Distributions using Spectral Decompositions

We study the problem of learning a distribution from samples, when the underlying distribution is a mixture of product distributions over discrete domains. This problem is motivated by several practical applications such as crowdsourcing, recommendation systems, and learning Boolean functions. The existing solutions either heavily rely on the fact that the number of mixtures is finite or have s...

متن کامل

Marginal Likelihood Integrals for Mixtures of Independence Models

Inference in Bayesian statistics involves the evaluation of marginal likelihood integrals. We present algebraic algorithms for computing such integrals exactly for discrete data of small sample size. Our methods apply to both uniform priors and Dirichlet priors. The underlying statistical models are mixtures of independent distributions, or, in geometric language, secant varieties of Segre-Vero...

متن کامل

Learning Mixtures of Distributions

Learning Mixtures of Distributions by Kamalika Chaudhuri Doctor of Philosophy in Computer Science University of California, Berkeley Professor Satish Rao, Chair This thesis studies the problem of learning mixtures of distributions, a natural formalization of clustering. A mixture of distributions is a collection of distributions D = {D1, . . .DT}, and mixing weights, {w1, . . . , wT} such that ...

متن کامل

Exact Evaluation of Marginal Likelihood Integrals

Inference in Bayesian statistics involves the evaluation of marginal likelihood integrals. We present algebraic algorithms for computing such integrals exactly for discrete data of small sample size. The underlying statistical models are mixtures of independent distributions, or, in geometric language, secant varieties of Segre-Veronese varieties.

متن کامل

On Spectral Learning of Mixtures of Distributions

We consider the problem of learning mixtures of distributions via spectral methods and derive a tight characterization of when such methods are useful. Specifically, given a mixture-sample, let μi, Ci, wi denote the empirical mean, covariance matrix, and mixing weight of the i-th component. We prove that a very simple algorithm, namely spectral projection followed by single-linkage clustering, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/0802.1244  شماره 

صفحات  -

تاریخ انتشار 2008