A provable SVD-based algorithm for learning topics in dominant admixture corpus

نویسندگان

  • Trapit Bansal
  • Chiranjib Bhattacharyya
  • Ravi Kannan
چکیده

Topic models, such as Latent Dirichlet Allocation (LDA), posit that documents are drawn from admixtures of distributions over words, known as topics. The inference problem of recovering topics from such a collection of documents drawn from admixtures, is NP-hard. Making a strong assumption called separability, [4] gave the first provable algorithm for inference. For the widely used LDA model, [6] gave a provable algorithm using clever tensor-methods. But [4, 6] do not learn topic vectors with bounded l1 error (a natural measure for probability vectors). Our aim is to develop a model which makes intuitive and empirically supported assumptions and to design an algorithm with natural, simple components such as SVD, which provably solves the inference problem for the model with bounded l1 error. A topic in LDA and other models is essentially characterized by a group of co-occurring words. Motivated by this, we introduce topic specific Catchwords, a group of words which occur with strictly greater frequency in a topic than any other topic individually and are required to have high frequency together rather than individually. A major contribution of the paper is to show that under this more realistic assumption, which is empirically verified on real corpora, a singular value decomposition (SVD) based algorithm with a crucial pre-processing step of thresholding, can provably recover the topics from a collection of documents drawn from Dominant admixtures. Dominant admixtures are convex combination of distributions in which one distribution has a significantly higher contribution than the others. Apart from the simplicity of the algorithm, the sample complexity has near optimal dependence on w0, the lowest probability that a topic is dominant, and is better than [4]. Empirical evidence shows that on several real world corpora, both Catchwords and Dominant admixture assumptions hold and the proposed algorithm substantially outperforms the state of the art [5].

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Supplementary for “ A provable SVD - based algorithm for learning topics in dominant admixture corpus ”

Informally, a document is said to be drawn from a Dominant Admixture if the document has one dominant topic. Besides its simplicity, we show empirical evidence from real corpora to demonstrate that topic dominance is a reasonable assumption. The Dominant Topic assumption is weaker than the Pure Topic assumption. More importantly SVD based procedures proposed by [2] will not apply. Inspired by t...

متن کامل

Building Topic Models Based on Anchor Words

Suppose you were given a stack of documents, such as all of the articles published in a particular newspaper, and your goal was to make sense of this data, to determine topics that this data may be made up from. To frame this as an unsupervised learning problem, suppose the documents were written in a foreign language and came from a foreign planet. By understanding topics that these documents ...

متن کامل

From Correlation to Hierarchy: Practical Topic Modeling via Spectral Inference

Topic models were originally applied in text analysis for extracting high-level themes from documents, but they work equally well in any setting where users select items from an inventory. Recent work in spectral topic modeling has provided algorithms that operate only on easily-collected summary statistics, rather than exhaustively iterating over the full dataset. The “anchor word” algorithms ...

متن کامل

Research on Color Watermarking Algorithm Based on RDWT-SVD

In this paper, a color image watermarking algorithm based on Redundant Discrete Wavelet Transform (RDWT) and Singular Value Decomposition (SVD) is proposed. The new algorithm selects blue component of a color image to carry the watermark information since the Human Visual System (HVS) is least sensitive to it. To increase the robustness especially towards affine attacks, RDWT is adopted for its...

متن کامل

A Practical Algorithm for Topic Modeling with Provable Guarantees

Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model learning have been based on a maximum likelihood objective. Efficient algorithms exist that attempt to approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced that provide provable bounds, but th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014