WordNet-Based Text Document Clustering

نویسندگان

  • Julian Sedding
  • Dimitar Kazakov
چکیده

Text document clustering can greatly simplify browsing large collections of documents by reorganizing them into a smaller number of manageable clusters. Algorithms to solve this task exist; however, the algorithms are only as good as the data they work on. Problems include ambiguity and synonymy, the former allowing for erroneous groupings and the latter causing similarities between documents to go unnoticed. In this research, näıve, syntax-based disambiguation is attempted by assigning each word a part-of-speech tag and by enriching the ‘bag-ofwords’ data representation often used for document clustering with synonyms and hypernyms from WordNet.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Document Representation Model for Clustering

Text document plays an important role in providing better document retrieval, document browsing and text mining. Traditionally, clustering techniques do not consider the semantics relationships between words, such as synonymy and hypernymy. Existing clustering techniques are based on the syntactic structure of the document. To exploit semantic relationships, WordNet has been used to improve clu...

متن کامل

Enhancing Text Document Clustering Using Non-negative Matrix Factorization and WordNet

A classic document clustering technique may incorrectly classify documents into different clusters when documents that should belong to the same cluster do not have any shared terms. Recently, to overcome this problem, internal and external knowledge-based approaches have been used for text document clustering. However, the clustering results of these approaches are influenced by the inherent s...

متن کامل

Concept Chain Based Text Clustering

Different from familiar clustering objects, text documents have sparse data spaces. A common way of representing a document is as a bag of its component words, but the semantic relations between words are ignored. In this paper, we propose a novel document representation approach to strengthen the discriminative feature of document objects. We replace terms of documents with concepts in WordNet...

متن کامل

Visual Text Summarization in Supervised and Unsupervised Constraints Using CITCC

Abstract: In this work clustering performance has been increased by proposes an algorithm called constrained informationtheoretic co-clustering (CITCC). In this work mainly focus on co-clustering and constrained clustering. Co-clustering method is differing from clustering methods it examine both document and word at a same time. A novel constrained coclustering approach proposed that automatic...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004