Semantic Clustering: exploiting Linguistic Information

نویسندگان

Adrian Ivo Kuhn

Adrian Kuhn

چکیده

Many approaches have been developed to comprehend software source code, most of them focusing on program structural information. However, in doing so we are missing a crucial information, namely, the domain semantics information contained in the text or symbols of the source code. When we are to understand software as a whole, we need to enrich these approaches with conceptual insights gained from the domain semantics. This paper proposes the use of information retrieval techniques to exploit linguistic information, such as identifier names and comments in source code, to gain insights into how the domain is mapped to the code. We introduce Semantic Clustering, an algorithm to group source artifacts based on how they use similar terms. The algorithm uses Latent Semantic Indexing. After detecting the clusters, we provide an automatic labeling and then we visually explore how the clusters are spread over the system. Our approach works at the source code textual level which makes it language independent. Nevertheless, we correlate the semantics with structural information and we apply it at different levels of abstraction (for example packages, classes, methods). To validate our approach we applied it on several case studies.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Automating the Generation of Semantic Annotation Schema Using a Clustering Technique

In order to generate semantic annotations for a collection of documents, one needs an annotation schema consisting of a semantic model (a.k.a. ontology) along with lists of linguistic indicators (keywords and patterns) for each concept in the ontology. The focus of this paper is the automatic generation of the linguistic indicators for a given semantic model and a corpus of documents. Our appro...

متن کامل

SR-clustering: Semantic regularized clustering for egocentric photo streams segmentation

While wearable cameras are becoming increasingly popular, locating relevant information in large unstructured collections of egocentric images is still a tedious and time consuming process. This paper addresses the problem of organizing egocentric photo streams acquired by a wearable camera into semantically meaningful segments, hence making an important step towards the goal of automatically a...

متن کامل

Identifying Relational Concept Lexicalisations by Using General Linguistic Knowledge

This paper analyses how general-purpose semantic hierarchies could be helpful in the construction of one-to-many mappings between the coarse-grained relational concepts and the corresponding linguistic realisations. We propose an original model, the semantic fingerprint, for exploiting ambiguous semantic information within the feature vector model.

متن کامل

Clustering of Terms from Translation Dictionaries and Synonyms Lists to Automatically Build more Structured Linguistic Resources

Building a Linguistic Resource (LR) is a task requiring a huge quantitative of means, human resources and funds. Though finalization of the development phase and assessment of the produced resource, necessarily require human involvement, a computer aided process for building the resource’s initial structure would greatly reduce the overall effort to be undertaken. We present here a novel approa...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Semantic Clustering: exploiting Linguistic Information

نویسندگان

چکیده

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Automating the Generation of Semantic Annotation Schema Using a Clustering Technique

SR-clustering: Semantic regularized clustering for egocentric photo streams segmentation

Identifying Relational Concept Lexicalisations by Using General Linguistic Knowledge

Clustering of Terms from Translation Dictionaries and Synonyms Lists to Automatically Build more Structured Linguistic Resources

عنوان ژورنال:

اشتراک گذاری