Cleaning StackOverflow for Machine Translation

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR) Pub Date : 2019-05-26 DOI:10.1109/MSR.2019.00021

Musfiqur Rahman, Peter C. Rigby, Dharani Palani, T. Nguyen

{"title":"Cleaning StackOverflow for Machine Translation","authors":"Musfiqur Rahman, Peter C. Rigby, Dharani Palani, T. Nguyen","doi":"10.1109/MSR.2019.00021","DOIUrl":null,"url":null,"abstract":"Generating source code API sequences from an English query using Machine Translation (MT) has gained much interest in recent years. For any kind of MT, the model needs to be trained on a parallel corpus. In this paper we clean StackOverflow, one of the most popular online discussion forums for programmers, to generate a parallel English-Code corpus from Android posts. We contrast three data cleaning approaches: standard NLP, title only, and software task extraction. We evaluate the quality of the each corpus for MT. To provide indicators of how useful each corpus will be for machine translation, we provide researchers with measurements of the corpus size, percentage of unique tokens, and per-word maximum likelihood alignment entropy. We have used these corpus cleaning approaches to translate between English and Code [22, 23], to compare existing SMT approaches from word mapping to neural networks [24], and to re-examine the \"natural software\" hypothesis [29]. After cleaning and aligning the data, we create a simple maximum likelihood MT model to show that English words in the corpus map to a small number of specific code elements. This model provides a basis for the success of using StackOverflow for search and other tasks in the software engineering literature and paves the way for MT. Our scripts and corpora are publicly available on GitHub [1] as well as at https://search.datacite.org/works/10.5281/zenodo.2558551.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"18 1","pages":"79-83"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSR.2019.00021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Generating source code API sequences from an English query using Machine Translation (MT) has gained much interest in recent years. For any kind of MT, the model needs to be trained on a parallel corpus. In this paper we clean StackOverflow, one of the most popular online discussion forums for programmers, to generate a parallel English-Code corpus from Android posts. We contrast three data cleaning approaches: standard NLP, title only, and software task extraction. We evaluate the quality of the each corpus for MT. To provide indicators of how useful each corpus will be for machine translation, we provide researchers with measurements of the corpus size, percentage of unique tokens, and per-word maximum likelihood alignment entropy. We have used these corpus cleaning approaches to translate between English and Code [22, 23], to compare existing SMT approaches from word mapping to neural networks [24], and to re-examine the "natural software" hypothesis [29]. After cleaning and aligning the data, we create a simple maximum likelihood MT model to show that English words in the corpus map to a small number of specific code elements. This model provides a basis for the success of using StackOverflow for search and other tasks in the software engineering literature and paves the way for MT. Our scripts and corpora are publicly available on GitHub [1] as well as at https://search.datacite.org/works/10.5281/zenodo.2558551.

查看原文本刊更多论文

清理机器翻译的StackOverflow

使用机器翻译(MT)从英语查询生成源代码API序列近年来引起了人们的广泛关注。对于任何类型的机器翻译，模型都需要在并行语料库上进行训练。在本文中，我们清理了StackOverflow，一个最受欢迎的程序员在线讨论论坛，从Android帖子中生成一个并行的英语代码语料库。我们对比了三种数据清理方法:标准NLP，仅标题和软件任务提取。我们为机器翻译评估每个语料库的质量。为了提供每个语料库对机器翻译的有用程度的指标，我们为研究人员提供了语料库大小、唯一令牌百分比和每个单词最大似然对齐熵的测量值。我们使用这些语料库清理方法在英语和代码之间进行翻译[22,23]，比较从词映射到神经网络的现有SMT方法[24]，并重新检验“自然软件”假设[29]。在清理和对齐数据之后，我们创建了一个简单的最大似然MT模型，以显示语料库中的英语单词映射到少量特定的代码元素。该模型为在软件工程文献中成功使用StackOverflow进行搜索和其他任务提供了基础，并为机器翻译铺平了道路。我们的脚本和语料库在GitHub[1]以及https://search.datacite.org/works/10.5281/zenodo.2558551上公开可用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

自引率

0.00%

发文量