Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) Pub Date : 2018-05-23 DOI:10.1145/3196398.3196408

Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, Graham Neubig

{"title":"Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow","authors":"Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, Graham Neubig","doi":"10.1145/3196398.3196408","DOIUrl":null,"url":null,"abstract":"For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. StackOverflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g. pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.","PeriodicalId":6639,"journal":{"name":"2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)","volume":"19 1","pages":"476-486"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"188","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3196398.3196408","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 188

Abstract

For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. StackOverflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g. pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.

查看原文本刊更多论文

学习从堆栈溢出中挖掘对齐代码和自然语言对

对于像从自然语言合成代码、代码检索和代码总结这样的任务，数据驱动模型已经显示出巨大的前景。然而，创建这些模型需要自然语言(NL)和具有细粒度对齐的代码之间的并行数据。StackOverflow (SO)是创建这样一个数据集的一个很有前途的来源:问题是多种多样的，其中大多数问题都有相应的答案和高质量的代码片段。然而，现有的启发式方法(例如，将帖子标题与接受答案中的代码配对)在其覆盖范围和获得的nl代码对的正确性方面都受到限制。在本文中，我们提出了一种利用两组特征从SO中挖掘高质量对齐数据的新方法:考虑提取片段结构的手工制作特征，以及通过训练概率模型获得的对应特征，以使用神经网络捕获NL和代码之间的相关性。这些特征被输入到一个分类器中，该分类器确定所挖掘的nl代码对的质量。使用Python和Java作为测试平台的实验表明，即使只使用少量标记示例，所提出的方法也大大扩展了现有挖掘方法的覆盖范围和准确性。此外，我们发现，即使在一种语言上训练分类器并在另一种语言上进行测试，也能获得合理的结果，这表明了将自然语言代码挖掘扩展到各种编程语言的希望，而不仅仅是那些我们能够注释数据的编程语言。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)

自引率

0.00%

发文量