Topic extraction from extremely short texts with variational manifold regularization

نویسندگان

چکیده

With the emerging of massive short texts, e.g., social media posts and question titles from Q&A systems, discovering valuable information them is increasingly significant for many real-world applications content analysis. The family topic modeling can effectively explore hidden structures documents through assumptions latent topics. However, due to sparseness existing models, Dirichlet allocation, lose effectiveness on them. To this end, an effective solution, namely multinomial mixture (DMM), supposing that each text only associated with a single topic, indirectly enriches document-level word co-occurrences. DMM sensitive noisy words, where it often learns inaccurate representations at document level. address problem, we extend novel Laplacian Multinomial Mixture (LapDMM) model texts. basic idea LapDMM preserve local neighborhood enabling spread topical signals among neighboring documents, so as modify representations. This achieved by incorporating variational manifold regularization into objective DMM, constraining close texts similar find nearest neighbors before inference, construct offline graph, distances be computed mover’s distance. We further develop online version LapDMM, Online achieve inference speedup Carrying implications, exploit spirit stochastic optimization mini-batches up-to-date graph efficiently approximate instead. evaluate our compare against state-of-the-art models several traditional tasks, i.e., quality, clustering classification. empirical results demonstrate very performance gains over baseline models.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic Segmentation for Short Texts

Topic segmentation, which aims to fmd the boundaries between topic blocks in a text, is an important task for semantic analysis of texts. Although different solutions have been proposed for the task, many limitations and difficulties exist in the approaches. In particular most of the methods do not work well for such case as short texts, internet news and student's writings. In this paper, we f...

متن کامل

Semi-supervised Max-margin Topic Model with Manifold Posterior Regularization

Supervised topic models leverage label information to learn discriminative latent topic representations. As collecting a fully labeled dataset is often time-consuming, semi-supervised learning is of high interest. In this paper, we present an effective semi-supervised max-margin topic model by naturally introducing manifold posterior regularization to a regularized Bayesian topic model, named L...

متن کامل

Enhancing Topic Modeling on Short Texts with Crowdsourcing

Topic modeling is nowadays widely used in text archive analytics, to find significant topics in news articles and important aspects of product comments available on the Internet. While statistical approaches, e.g. Latent Dirichlet Allocation (LDA) and its variants, are effective on building topic models on long texts, it remains difficult to identify meaningful topics over short texts, e.g. new...

متن کامل

Hot Topic Extraction and Public Opinion Classification of Tibetan Texts

The increasing amount of Tibetan information has made Tibetan text processing popular and highly significant. In this study, Tibetan hot topic extraction and public opinion classification were investigated to accelerate the development of Tibetan information processing. First, Tibetan word segmentation in Tibetan hot topic extraction was presented. Second, feature selection based on term freque...

متن کامل

Topic Modeling over Short Texts by Incorporating Word Embeddings

Inferring topics from the overwhelming amount of short texts becomes a critical but challenging task for many content analysis tasks, such as content charactering, user interest profiling, and emerging topic detecting. Existing methods such as probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) cannot solve this problem very well since only very limited word co-o...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Machine Learning

سال: 2021

ISSN: ['0885-6125', '1573-0565']

DOI: https://doi.org/10.1007/s10994-021-05962-3