Natural Software Revisited

2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) Pub Date : 2019-05-25 DOI:10.1109/ICSE.2019.00022

Musfiqur Rahman, Dharani Palani, Peter C. Rigby

{"title":"Natural Software Revisited","authors":"Musfiqur Rahman, Dharani Palani, Peter C. Rigby","doi":"10.1109/ICSE.2019.00022","DOIUrl":null,"url":null,"abstract":"Recent works have concluded that software code is more repetitive and predictable, i.e. more natural, than English texts. On re-examination, we find that much of the apparent \"naturalness\" of source code is due to the presence of language specific syntax, especially separators, such as semi-colons and brackets. For example, separators account for 44% of all tokens in our Java corpus. When we follow the NLP practices of eliminating punctuation (e.g., separators) and stopwords (e.g., keywords), we find that code is still repetitive and predictable, but to a lesser degree than previously thought. We suggest that SyntaxTokens be filtered to reduce noise in code recommenders. Unlike the code written for a particular project, API code usage is similar across projects: a file is opened and closed in the same manner regardless of domain. When we restrict our n-grams to those contained in the Java API, we find that API usages are highly repetitive. Since API calls are common across programs, researchers have made reliable statistical models to recommend sophisticated API call sequences. Sequential n-gram models were developed for natural languages. Code is usually represented by an AST which contains control and data flow, making n-grams models a poor representation of code. Comparing n-grams to statistical graph representations of the same codebase, we find that graphs are more repetitive and contain higherlevel patterns than n-grams. We suggest that future work focus on statistical code graphs models that accurately capture complex coding patterns. Our replication package makes our scripts and data available to future researchers[1].","PeriodicalId":6736,"journal":{"name":"2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)","volume":"22 1","pages":"37-48"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSE.2019.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 34

Abstract

Recent works have concluded that software code is more repetitive and predictable, i.e. more natural, than English texts. On re-examination, we find that much of the apparent "naturalness" of source code is due to the presence of language specific syntax, especially separators, such as semi-colons and brackets. For example, separators account for 44% of all tokens in our Java corpus. When we follow the NLP practices of eliminating punctuation (e.g., separators) and stopwords (e.g., keywords), we find that code is still repetitive and predictable, but to a lesser degree than previously thought. We suggest that SyntaxTokens be filtered to reduce noise in code recommenders. Unlike the code written for a particular project, API code usage is similar across projects: a file is opened and closed in the same manner regardless of domain. When we restrict our n-grams to those contained in the Java API, we find that API usages are highly repetitive. Since API calls are common across programs, researchers have made reliable statistical models to recommend sophisticated API call sequences. Sequential n-gram models were developed for natural languages. Code is usually represented by an AST which contains control and data flow, making n-grams models a poor representation of code. Comparing n-grams to statistical graph representations of the same codebase, we find that graphs are more repetitive and contain higherlevel patterns than n-grams. We suggest that future work focus on statistical code graphs models that accurately capture complex coding patterns. Our replication package makes our scripts and data available to future researchers[1].

查看原文本刊更多论文

重新审视自然软件

最近的研究已经得出结论，软件代码比英文文本更具重复性和可预测性，即更自然。通过重新检查，我们发现源代码的许多明显的“自然性”是由于语言特定语法的存在，特别是分隔符，如分号和括号。例如，分隔符占Java语料库中所有令牌的44%。当我们遵循NLP消除标点符号(如分隔符)和停顿词(如关键字)的做法时，我们发现代码仍然是重复和可预测的，但程度比以前想象的要小。我们建议过滤SyntaxTokens以减少代码推荐中的噪音。与为特定项目编写的代码不同，API代码的使用在项目之间是相似的:无论域如何，文件都以相同的方式打开和关闭。当我们将n-gram限制为Java API中包含的n-gram时，我们发现API的使用是高度重复的。由于API调用在程序中很常见，研究人员已经建立了可靠的统计模型来推荐复杂的API调用序列。为自然语言开发了顺序n-gram模型。代码通常由包含控制和数据流的AST表示，这使得n-gram模型不能很好地表示代码。将n图与相同代码库的统计图表示进行比较，我们发现图比n图更具重复性，并且包含更高层次的模式。我们建议未来的工作集中在统计代码图模型上，以准确地捕获复杂的编码模式。我们的复制包使我们的脚本和数据可供未来的研究人员使用[1]。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)

自引率

0.00%

发文量