Integrating Background Knowledge Into Text Classification
نویسندگان
چکیده
We present a description of three different algorithms that use background knowledge to improve text classifiers. One uses the background knowledge as an index into the set of training examples. The second method uses background knowledge to reexpress the training examples. The last method treats pieces of background knowledge as unlabeled examples, and actually classifies them. The choice of background knowledge affects each method’s performance and we discuss which type of background knowledge is most useful for each specific method. 1 Using Background Knowledge Supervised learning algorithms rely on a corpus of labeled training examples to produce accurate automatic text classifiers. An insufficient number of training examples often results in learned models that are suboptimal when classifying previously unseen examples. Numerous different approaches have been taken to compensate for the lack of training examples. These include the use of unlabeled examples [Bennet and Demiriz, 1998; Blum and Mitchell, 1998; Nigam et al., 2000; Goldman and Zhou, 2000], the use of test examples [Joachims, 1999], and choosing a small set of specific unlabeled examples to be manually classified [Lewis and Gale, 1994]. Our approach does not assume the availability of either unlabeled examples or test examples. As a result of the explosion of the amount of data that is available, it is often the case that text, databases and other sources of knowledge that are related to the text classification task are readily available from the World Wide Web. We incorporate such “background knowledge” into different learners to improve classification of unknown instances. The use of external readily available textual resources allows learning systems to model the domain in a way that would be impossible by simply using a small set of training instances. For example, if a text classification task is to determine the sub-discipline of physics that a paper title should belong to, background knowledge such as abstracts, physics newsgroups, and perhaps even book reviews of physics books can be used by learners to create more accurate classifiers. We present three methods of incorporating background knowledge into the text classification task. Each of these methods uses the corpus of background knowledge in a different way, yet empirically, on a wide variety of text classification tasks we can show that accuracy on test sets can be improved when incorporating background knowledge into these systems. We ran all three methods incorporating background knowledge on a range of problems from nine different text classification tasks. Details on the data sets can be found at (www.cs.csi.cuny.edu/ ̃zelikovi/datasets; each varied on the size of each example, the size of each piece of background knowledge, the number of examples and number of items of background knowledge, and the relationship of the background knowledge to the classification task.
منابع مشابه
Integrating Background Knowledge into Nearest-Neighbor Text Classification
This paper describes two different approaches for incorporating background knowledgeinto nearest-neighbor text classification.Our first approachuses backgroundtext to assessthe similarity betweentraining and test documentsrather than assessing their similarity directly. The second method redescribes examples using Latent Semantic Indexing on the background knowledge, assessing document similari...
متن کاملNeural Network Based Recognition System Integrating Feature Extraction and Classification for English Handwritten
Handwriting recognition has been one of the active and challenging research areas in the field of image processing and pattern recognition. It has numerous applications that includes, reading aid for blind, bank cheques and conversion of any hand written document into structural text form. Neural Network (NN) with its inherent learning ability offers promising solutions for handwritten characte...
متن کاملContent-based Text Categorization using Wikitology
The process of text categorization assigns labels or categories to each text document according to the semantic content of the document. The traditional approaches to text categorization used features from the text like: words, phrases, and concepts hierarchies to represent and reduce the dimensionality of the documents. Recently, researchers addressed this brittleness by incorporating backgrou...
متن کاملImproving Spamdexing Detection Via a Two-Stage Classification Strategy
p. 1 Exploring the Stability of IDF Term Weighting p. 10 Completely-Arbitrary Passage Retrieval in Language Modeling Approach p. 22 Semantic Discriminative Projections for Image Retrieval p. 34 Comparing Dissimilarity Measures for Content-Based Image Retrieval p. 44 A Semantic Content-Based Retrieval Method for Histopathology Images p. 51 Integrating Background Knowledge into RBF Networks for T...
متن کاملIncorporating Background Knowledge into Text Classification
It has been shown that prior knowledge and information are organized according to categories, and that also background knowledge plays an important role in classification. The purpose of this study is first, to investigate the relationship between background knowledge and text classification, and second, to incorporate this relationship in a computational model. Our behavioral results demonstra...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003