Code Similarity via Natural Language Descriptions

نویسندگان

  • Eran Yahav
  • Meital Ben Sinai
چکیده

Code similarity is a central challenge in many programming related applications, such as code search, automatic translation, and plagiarism detection. In this work, we reduce the problem of semantic relatedness between code fragments into a problem of semantic relatedness of textual descriptions. Our main idea is that we can use the relationship between code and its textual descriptions as established in question-answering sites such as STACKOVERFLOW. Consequently, we can determine semantic relatedness and similarity, of code fragments across different programming languages, a task considered extremely difficult using traditional approaches. We have implemented our approach, and used crowed-sourced labeling of similarity to evaluate it over 1500 pairs of code fragments. Results show that we gain around 80% precision and 75% recall, and demonstrate the promise of this approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semantic approaches to software component retrieval with English queries

Enabling code reuse is an important goal in software engineering, and it depends crucially on effective code search interfaces. We propose to ground word meanings in source code and use such language-code mappings in order to enable a search engine for programming library code where users can pose queries in English. We exploit the fact that there are large programming language libraries which ...

متن کامل

Using english to retrieve software

This paper describes ROSA, a software reuse system based on the processing of the natural language descriptions of software artifacts. Lexical, syntactic and semantic analysis of software descriptions is performed to automatically extract both verbal and nominal phrases from descriptions and use this information to create frame-based indexing units for software components. Retrieval similarity ...

متن کامل

A similarity measure for retrieving software artifacts

presents the mechanism for query processing and retrieval with the measures used for the similarity analysis of the indexing structures. Section 6 describes an experiment conducted to evaluate the effectiveness of the proposed approach. Section 7 summarizes related work in the area of re-use systems. Section 8 concludes the paper with some remarks on planned experiments with the system and furt...

متن کامل

A Syntactic Neural Model for General-Purpose Code Generation

We consider the problem of parsing natural language descriptions into source code written in a general-purpose programming language like Python. Existing datadriven methods treat this problem as a language generation task without considering the underlying syntax of the target programming language. Informed by previous work in semantic parsing, in this paper we propose a novel neural architectu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014