Large-Scale Bayesian Logistic Regression for Text Categorization

نویسندگان

  • Alexander Genkin
  • David D. Lewis
  • David Madigan
چکیده

Logistic regression analysis of high-dimensional data, such as natural language text, poses computational and statistical challenges. Maximum likelihood estimation often fails in these applications. We present a simple Bayesian logistic regression approach that uses a Laplace prior to avoid overfitting and produces sparse predictive models for text data. We apply this approach to a range of document classification problems and show that it produces compact predictive models at least as effective as those produced by support vector machine classifiers or ridge logistic regression combined with feature selection. We describe our model fitting algorithm, our open source implementations (BBR and BMR), and experimental results.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sparse Logistic Regression for Text Categorization

This paper studies regularized logistic regression and its application to text categorization. In particular we examine a Bayesian approach, lasso logistic regression, that simultaneously selects variables and provides regularization. We present an efficient training algorithm for this approach, and show that the resulting classifiers are both compact and have state-of-the-art effectiveness on ...

متن کامل

Bayesian Text Categorization

Natural language processing is an interdisciplinary field of research which studies the problems and possibilities of automated generation and understanding of natural human languages. Text categorization is a central subfield of natural language processing. Automatically assigning categories to digital texts has a wide range of applications in today’s information society—from filtering spam to...

متن کامل

A sparse version of the ridge logistic regression for large-scale text categorization

The ridge logistic regression has successfully been used in text categorization problems and it has been shown to reach the same performance as the Support Vector Machine but with the main advantage of computing a probability value rather than a score. However, the dense solution of the ridge makes its use unpractical for large scale categorization. On the other side, LASSO regularization is ab...

متن کامل

A new term-weighting scheme for naïve Bayes text categorization

Purpose – Automatic text categorization has applications in several domains, for example e-mail spam detection, sexual content filtering, directory maintenance, and focused crawling, among others. Most information retrieval systems contain several components which use text categorization methods. One of the first text categorization methods was designed using a naı̈ve Bayes representation of the...

متن کامل

A flexible Bayesian generalized linear model for dichotomous response data with an application to text categorization

Abstract: We present a class of sparse generalized linear models that include probit and logistic regression as special cases and offer some extra flexibility. We provide an EM algorithm for learning the parameters of these models from data. We apply our method in text classification and in simulated data and show that our method outperforms the logistic and probit models and also the elastic n...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Technometrics

دوره 49  شماره 

صفحات  -

تاریخ انتشار 2007