Semantic Clustering: exploiting Linguistic Information
نویسندگان
چکیده
Many approaches have been developed to comprehend software source code, most of them focusing on program structural information. However, in doing so we are missing a crucial information, namely, the domain semantics information contained in the text or symbols of the source code. When we are to understand software as a whole, we need to enrich these approaches with conceptual insights gained from the domain semantics. This paper proposes the use of information retrieval techniques to exploit linguistic information, such as identifier names and comments in source code, to gain insights into how the domain is mapped to the code. We introduce Semantic Clustering, an algorithm to group source artifacts based on how they use similar terms. The algorithm uses Latent Semantic Indexing. After detecting the clusters, we provide an automatic labeling and then we visually explore how the clusters are spread over the system. Our approach works at the source code textual level which makes it language independent. Nevertheless, we correlate the semantics with structural information and we apply it at different levels of abstraction (for example packages, classes, methods). To validate our approach we applied it on several case studies.
منابع مشابه
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملAutomating the Generation of Semantic Annotation Schema Using a Clustering Technique
In order to generate semantic annotations for a collection of documents, one needs an annotation schema consisting of a semantic model (a.k.a. ontology) along with lists of linguistic indicators (keywords and patterns) for each concept in the ontology. The focus of this paper is the automatic generation of the linguistic indicators for a given semantic model and a corpus of documents. Our appro...
متن کاملSR-clustering: Semantic regularized clustering for egocentric photo streams segmentation
While wearable cameras are becoming increasingly popular, locating relevant information in large unstructured collections of egocentric images is still a tedious and time consuming process. This paper addresses the problem of organizing egocentric photo streams acquired by a wearable camera into semantically meaningful segments, hence making an important step towards the goal of automatically a...
متن کاملIdentifying Relational Concept Lexicalisations by Using General Linguistic Knowledge
This paper analyses how general-purpose semantic hierarchies could be helpful in the construction of one-to-many mappings between the coarse-grained relational concepts and the corresponding linguistic realisations. We propose an original model, the semantic fingerprint, for exploiting ambiguous semantic information within the feature vector model.
متن کاملClustering of Terms from Translation Dictionaries and Synonyms Lists to Automatically Build more Structured Linguistic Resources
Building a Linguistic Resource (LR) is a task requiring a huge quantitative of means, human resources and funds. Though finalization of the development phase and assessment of the produced resource, necessarily require human involvement, a computer aided process for building the resource’s initial structure would greatly reduce the overall effort to be undertaken. We present here a novel approa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006