Medline Document Clustering with Semi-Supervised Spectral Clustering Algorithm
نویسندگان
چکیده
To clustering biomedical documents, three different types of information’s are used. They are local content (LC),global content(GC) and mesh semantic(MS).In previous method only one are two types of information are cluster using Constraints and distance based algorithm. But in proposed system we used Semi Supervised clustering algorithm. It made most of the noisy constraints to improve clustering performance. The result will be highly powerful and very promising. Keywords–Biomedical text mining, document clustering, semi supervised clustering, spectral clustering. I.INTRODUCTION Literature reading is an important approach for bio-medical researchers to trace scientific progress and generate new scientific hypothesis. The most major searching targetis MEDLINE, the largest biomedical literature database, covering around 5600 life science journals published worldwide resulting in around 21 million citations dating back to 1948,further being with PubMed (http://www.ncbi.nlm.nih.gov/pubmed/), an online searching service [1]. Mining MEDLINE for efficient knowledge discovery has become an active re-search field. In particular, document clustering, i.e., grouping similar documents together and separating dissimilar documents automatically, can greatly contribute to managing and organizing literatures, navigating and locating searching results, and providing personalized information services [4].Many studies have been carried out on MEDLINE document clustering. Traditionally, only local-content (LC) information of documents from the data set to be clustered has been utilized for clustering where each document is represented by “bag of words,” resulting in a weighted vector according to the so-called vector space model [2], and then, clustering is carried out on weighted vectors. However, MEDLINE documents have some distinct features that could be utilized for enhancing the clustering performance. First, for each MEDLINE document, PubMed provides a set of related articles in the whole MEDLINE collection, which is pre computed by comparing words from the title, the abstract, and the medical subject head-ing (MeSH) using a word-weighting algorithm . Recently, the odosiou et al. have made use of this kind of globalcontent (GC) information for clustering MEDLINE documents .Second, most of MEDLINE documents have been annotated by the MeSH (http://www.nlm.nih.gov/mesh/). The MeSH is acont rolled vocabulary thesaurus with a set of description terms organized in a hierarchical structure where general concepts appear at the top and specific concepts appear at the bottom. Rich semantic information in MeSHs can improve the performance of clustering MEDLINE documents. modified terms in documents into MeSH concepts according to the MeSH thesaurus, showing the improvement in clustering performance under various methods, suchas k -means, bisecting k -means, and suffix tree clustering. A similar strategy was also used in term reweighting of document clustering [3]. However, this method no longer uses original texts, causing a problem that important content information in original documents may be lost. Overall, existing approacheson biomedical document clustering have two serious limitations: 1) using only one or two types of information and2) lacking effective algorithms to integrate different types of information. More recently, we have proposed an approach of linearly combining both the LC and MeSH-semantic (MS)similarities, empirically showing the performance advantage over that using only one of the two similarities .The linear combination strategy has been also used in other bioinformatics problems, such as gene clustering with multiple data (or constraints), including Gene Ontology, metabolic networks, and gene expression. In this case, once the datasets are integrated, we can use a variety of clustering models, e.g., hierarchical clustering , Gaussian mixture model ,k -medics, and Markov random fields . However, thisstrategy has roughly three underlying drawbacks in document clustering. First, the true similarity is not necessarily a simple linear relationship between different types of similarities (sources). Second, the quality of similarity in a data set may not be even for all document pairs. Some pairs are more reliable and should be paid more attention. For example, two documents with extremely high MS similarity usually reflects similar interests to be in the same cluster, while extremely low similarity Indian Journal of Emerging Electronics in Computer Communications, Vol.1, Issue 2, Dec 2014, ISSN: 2393-8366 228 might mean that the corresponding documents should not be in the same cluster. Third, it would be difficult to choosea suitable weighting configuration to balance three or moredifferent types of similarities in integrating them. Recently, semi supervised clustering has been extensivelystudied in machine learning and data mining[4].Semisu-pervised clustering algorithms incorporate prior knowledge toimprove the clustering performance. The prior knowledge isusually provided by labeled instances or, more typically, twotypes of constraints, i.e., must-link (ML) and cannot-link (CL),where ML means that the two corresponding examples shouldbe in the same cluster and CL means that the two examples should not be in the same cluster [18]–[27]. Constrainedk -means (SS-K-means) is an earlier semi supervised clustering algorithm, which was directly developed from k -means [18].In each iteration, SS-Kmeans tries to assign each instance tothe cluster with the most similar centroid, unless, at the sametime, the constraints are violated. Spectral clustering is a well-accepted method for clustering nodes over a graph (or an adjacency matrix), where clustering is a graph cut problem that can be solved by matrix trace optimization. Semi supervised non negative matrix factorization (SS-NMF) has been also developed toincorporate the ML and CL constraints [26]; similar to SL,the weight of two instances is set high for ML and low for CL.Practically, a large number of constraints cannot be givenapriori, and there are no established methods to generateconstraints. Thus, in our case, by using one type of similaritiesfor examples to be clustered and the other types of similarities for constraints, semi supervised clustering can address the limitation of linear combination strategy and might improve the performance of existing methods for clustering MEDLINE documents with three different types of similarities: the LC, GC ,and MS similarities. We present a new semi supervised cluster-ing method, which we call SSN Cut, based on the normalized cut We empirically demonstrate the performance of SSN Cut by using 100 data sets of MEDLINE documents with known class labels (biological topics). Experimental results showed that SSN Cut outperformed the up-to-date linear combination strategy, as well as several well-known semi-supervised clustering algorithms, being statistically significant. Moreover, the performance of SSN Cut using constraints from both the MS and GC similarities is better than that using only one type of similarity, meaning that our strategy of using three types of similarities is useful in MEDLINE document clustering. Another interesting discovery is that ML constraints more effectively worked than CL constraints, partially because around 10% of generated CL constraints were incorrect, while incorrect ML constraints were only around 1%.
منابع مشابه
Extracting Prior Knowledge from Data Distribution to Migrate from Blind to Semi-Supervised Clustering
Although many studies have been conducted to improve the clustering efficiency, most of the state-of-art schemes suffer from the lack of robustness and stability. This paper is aimed at proposing an efficient approach to elicit prior knowledge in terms of must-link and cannot-link from the estimated distribution of raw data in order to convert a blind clustering problem into a semi-supervised o...
متن کاملSpectral Clustering for Complex Settings
of the Dissertation Spectral Clustering for Complex Settings Many real-world datasets can be modeled as graphs, where each node corresponds to a data instance and an edge represents the relation/similarity between two nodes. To partition the nodes into different clusters, spectral clustering is used to find the normalized minimum cut of the graph (in the relaxed sense). As one of the most popul...
متن کاملSemi-supervised Spectral Clustering Algorithm Based on Bayesian Decision ⋆
Recently, semi-supervised spectral clustering algorithms have been developing rapidly, which are proposed to improve the clustering performance. In this paper, we first review the current existing spectral clustering algorithms in an unified-framework and give a straightforward explanation about the spectral clustering algorithm. Then, we present a semi-supervised method to improve the clusteri...
متن کاملThe Hong Kong Baptist University Adapting Kernel-based Methods to Semi-supervised Learning: from Multi-class Svm to Spectral Analysis a Research Prospectus Submitted to the Thesis Committee for Pursuing the Degree of Master of Philosophy Department of Computer Science by Wu Zhili
This prospectus proposes a preliminary research topic about fusing the kernel-based SVM method and the similarity-based spectral clustering into a semi-supervised learning algorithm under the scope of learning from both labeled and unlabeled data. For the past nine months before the prospectus comes out, much effort has been put to extend the bloomed SVM to more practicable multi-class learning...
متن کاملColor Image Segmentation Method Based on Improved Spectral Clustering Algorithm
Contraposing to the features of image data with high sparsity of and the problems on determination of clustering numbers, we try to put forward an color image segmentation algorithm, combined with semi-supervised machine learning technology and spectral graph theory. By the research of related theories and methods of spectral clustering algorithms, we introduce information entropy conception to...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016