Text mining with information-theoretic clustering
نویسندگان
چکیده
well as in other applications such as bioinformatics, we are interested in an angle between document vectors, hence, it is convenient to consider sets of normalized vectors. A wide variety of clustering algorithms applicable for objects of general nature exist,1 of these, the k-means algorithm might be the most widely accepted general clustering technique. In this article we suggest a new two-step clustering procedure that handles unit norm vectors. One of these steps is based on the k-means clustering algorithm. The k-means-type clustering algorithm’s final partition quality depends on a good initial partition choice.2,3 In the first step of the procedure, we use a spherical principal directions divisive partitioning (sPDDP) algorithm.3 Based on singular value decomposition (SVD), this algorithm clusters l2 unit vectors and generates “good” initial partitions for the second step of the procedure. The second step is a modification of a new divisive information–theoretic clustering algorithm4 (DITC), which uses the Kullback–Leibler (KL) divergence5 as a similarity measure between two unit l1 vectors with nonnegative coordinates. However, the DITC algorithm resembles the classical batch k-means algorithm and suffers from similar deficiencies. We discuss our modifications to the DITC algorithm to remedy some of its deficiencies. We call the proposed modification the Kullback–Leibler means algorithm, or KL-means.
منابع مشابه
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملSYDE 676 Project Report – Fall 2002 Web Document Clustering Using Phrase-based Document Similarity
Measuring the similarity between documents is an essential operation in text mining, especially document clustering. The traditional method of finding the similarity between documents has always been based on extracting individual words from the documents, and using heuristics to give weights to those features. Standard methods in data mining are then used to find the similarity between documen...
متن کاملAn Efficient Approach Generating Optimized Clusters for Theoretic Clustering Using Data Mining
The aim of the data mining process is to extract information from a large data set and transform it into an understandable structure for further use. Data mining is the process of finding anomalies, patterns and correlations within large data sets to predict outcomes. Using a broad range of techniques, you can use this information to increase revenues, cut costs, improve customer relationships,...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متنکاوی در حوزه یادگیری الکترونیکی
As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Computing in Science and Engineering
دوره 5 شماره
صفحات -
تاریخ انتشار 2003