Semisupervised Learning for Computational Linguistics
نویسندگان
چکیده
Semi-supervised learning is by no means an unfamiliar concept to natural language processing researchers. Labeled data has been used to improve unsupervised parameter estimation procedures such as the EM algorithm and its variants since the beginning of the statistical revolution in NLP (e.g., Pereira and Schabes (1992)). Unlabeled data has also been used to improve supervised learning procedures, the most notable examples being the successful applications of self-training and co-training to word sense disambiguation (Yarowsky 1995) and named entity classification (Collins and Singer 1999). Despite its increasing importance, semi-supervised learning is not a topic that is typically discussed in introductory machine learning texts (e.g., Mitchell (1997), Alpaydin (2004)) or NLP texts (e.g., Manning and Schütze (1999), Jurafsky andMartin (2000)). Consequently, to learn about semi-supervised learning research, one has to consult the machine-learning literature. This can be a daunting task for NLP researchers who have little background in machine learning. Steven Abney’s book Semisupervised Learning for Computational Linguistics is targeted precisely at such researchers, aiming to provide them with a “broad and accessible presentation” of topics in semi-supervised learning. According to the preamble, the reader is assumed to have taken only an introductory course in NLP “that include statistical methods — concretely the material contained in Jurafsky andMartin (2000) andManning and Schütze (1999).”Nonetheless, I agreewith the author that any NLP researcher who has a solid background in machine learning is ready to “tackle the primary literature on semisupervised learning, and will probably not find this book particularly useful” (page 11). As the author promises, the book is self-contained and quite accessible to those who have little background in machine learning. In particular, of the 12 chapters in the book, three are devoted to preparatory material, including: a brief introduction to machine learning, basic unconstrained and constrained optimization techniques (e.g., gradient descent and the method of Lagrange multipliers), and relevant linear-algebra concepts (e.g., eigenvalues, eigenvectors, matrix and vector norms, diagonalization). The remaining chapters focus roughly on six types of semi-supervised learning methods:
منابع مشابه
Book Reviews: Semisupervised Learning for Computational Linguistics by Steven Abney
Semi-supervised learning is by no means an unfamiliar concept to natural language processing researchers. Labeled data has been used to improve unsupervised parameter estimation procedures such as the EM algorithm and its variants since the beginning of the statistical revolution in NLP (e.g., Pereira and Schabes 1992). Unlabeled data has also been used to improve supervised learning procedures...
متن کاملAnalysis of Semi-Supervised Learning with the Yarowsky Algorithm
The Yarowsky algorithm is a rule-based semisupervised learning algorithm that has been successfully applied to some problems in computational linguistics. The algorithm was not mathematically well understood until (Abney 2004) which analyzed some specific variants of the algorithm, and also proposed some new algorithms for bootstrapping. In this paper, we extend Abney’s work and show that some ...
متن کاملWeb Corpus Construction
The Web is the main source of data in modern computational linguistics. Other volumes in the same series, for example, Introductions to Opinion Mining (Liu 2012) and Semisupervised Machine Learning (Søgaard 2013), start their problem statements by referring to data from the Web. This volume starts its own introduction by praising Web corpora for their size, ease of construction, and availabilit...
متن کاملExpanding textual entailment corpora fromWikipedia using co-training
In this paper we propose a novel method to automatically extract large textual entailment datasets homogeneous to existing ones. The key idea is the combination of two intuitions: (1) the use of Wikipedia to extract a large set of textual entailment pairs; (2) the application of semisupervised machine learning methods to make the extracted dataset homogeneous to the existing ones. We report emp...
متن کاملRobust Semi-Supervised Learning through Label Aggregation
Semi-supervised learning is proposed to exploit both labeled and unlabeled data. However, as the scale of data in real world applications increases significantly, conventional semisupervised algorithms usually lead to massive computational cost and cannot be applied to large scale datasets. In addition, label noise is usually present in the practical applications due to human annotation, which ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009