软件中自然语言工件的自然性

Proceedings of the 8th India Software Engineering Conference Pub Date : 2015-02-18 DOI:10.1145/2723742.2723758

G. Sridhara, Vibha Sinha, Senthil Mani

{"title":"软件中自然语言工件的自然性","authors":"G. Sridhara, Vibha Sinha, Senthil Mani","doi":"10.1145/2723742.2723758","DOIUrl":null,"url":null,"abstract":"We present a study on the naturalness of the natural language artifacts in software. Naturalness is essentially repetitiveness or predictability. By natural language artifacts, we mean source code comments, revision history messages, bug reports and so on. We measure \"naturalness\" using a standard measure, cross-entropy or perplexity from the widely used N-Gram models. Previously, Hindle et al. demonstrated empirically that source code was comparatively more repetitive or regular (i.e., more natural) when compared with traditional English text. A question that logically follows from their work is the naturalness of other artifacts associated with software. We present our findings on source code comments, commit logs, bug reports, string messages and content present in the popular question and answer forum, StackOverflow. Each of the artifact that we examine is a natural language artifact that is associated with software. However, they do not exhibit the same amount of regularity (naturalness). Commit logs were the most regular, followed by string literal messages and source code comments. Content from StackOverflow (viz., title, question and answers) showed a behavior similar to traditional English text i.e., comparatively lesser regularity. Bug reports from industrial projects exhibited more regularity than bug reports from open source projects, whose naturalness resembled that of typical English text. Our findings have implications for feasibility of building tools such as comment and bug report completion engines. We describe a next-word prediction tool that we built using the N-Gram language model. This tool achieved an accuracy ranging from 70 to 90% on commit messages in different projects. It also achieved an accuracy ranging from 56 to 78% on source comments. We also present a part of speech based analysis of words that are easy to predict and difficult to predict.","PeriodicalId":288030,"journal":{"name":"Proceedings of the 8th India Software Engineering Conference","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Naturalness of Natural Language Artifacts in Software\",\"authors\":\"G. Sridhara, Vibha Sinha, Senthil Mani\",\"doi\":\"10.1145/2723742.2723758\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a study on the naturalness of the natural language artifacts in software. Naturalness is essentially repetitiveness or predictability. By natural language artifacts, we mean source code comments, revision history messages, bug reports and so on. We measure \\\"naturalness\\\" using a standard measure, cross-entropy or perplexity from the widely used N-Gram models. Previously, Hindle et al. demonstrated empirically that source code was comparatively more repetitive or regular (i.e., more natural) when compared with traditional English text. A question that logically follows from their work is the naturalness of other artifacts associated with software. We present our findings on source code comments, commit logs, bug reports, string messages and content present in the popular question and answer forum, StackOverflow. Each of the artifact that we examine is a natural language artifact that is associated with software. However, they do not exhibit the same amount of regularity (naturalness). Commit logs were the most regular, followed by string literal messages and source code comments. Content from StackOverflow (viz., title, question and answers) showed a behavior similar to traditional English text i.e., comparatively lesser regularity. Bug reports from industrial projects exhibited more regularity than bug reports from open source projects, whose naturalness resembled that of typical English text. Our findings have implications for feasibility of building tools such as comment and bug report completion engines. We describe a next-word prediction tool that we built using the N-Gram language model. This tool achieved an accuracy ranging from 70 to 90% on commit messages in different projects. It also achieved an accuracy ranging from 56 to 78% on source comments. We also present a part of speech based analysis of words that are easy to predict and difficult to predict.\",\"PeriodicalId\":288030,\"journal\":{\"name\":\"Proceedings of the 8th India Software Engineering Conference\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-02-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 8th India Software Engineering Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2723742.2723758\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th India Software Engineering Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2723742.2723758","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

本文对软件中自然语言构件的自然性进行了研究。自然本质上是重复性或可预测性。通过自然语言工件，我们指的是源代码注释、修订历史消息、bug报告等等。我们使用标准度量，交叉熵或来自广泛使用的N-Gram模型的困惑度来衡量“自然性”。此前，Hindle等人通过经验证明，与传统的英文文本相比，源代码相对而言更具重复性或规律性(即更自然)。从他们的工作中逻辑上得出的一个问题是与软件相关的其他工件的自然性。我们在流行的问答论坛StackOverflow上展示我们对源代码注释、提交日志、bug报告、字符串消息和内容的发现。我们检查的每个工件都是与软件相关联的自然语言工件。然而，它们不表现出同样的规律性(自然性)。提交日志是最常规的，其次是字符串文字消息和源代码注释。来自StackOverflow的内容(即标题、问题和答案)表现出与传统英文文本相似的行为，即相对较少的规律性。来自工业项目的Bug报告比来自开源项目的Bug报告表现出更多的规律性，后者的自然性类似于典型的英文文本。我们的发现暗示了构建诸如注释和bug报告完成引擎之类的工具的可行性。我们描述了一个使用N-Gram语言模型构建的下一个单词预测工具。该工具对不同项目中的提交消息的准确度在70%到90%之间。它在源评论上也达到了56%到78%的准确率。我们还提出了一个基于词性分析的词，这些词容易预测和难以预测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Naturalness of Natural Language Artifacts in Software

We present a study on the naturalness of the natural language artifacts in software. Naturalness is essentially repetitiveness or predictability. By natural language artifacts, we mean source code comments, revision history messages, bug reports and so on. We measure "naturalness" using a standard measure, cross-entropy or perplexity from the widely used N-Gram models. Previously, Hindle et al. demonstrated empirically that source code was comparatively more repetitive or regular (i.e., more natural) when compared with traditional English text. A question that logically follows from their work is the naturalness of other artifacts associated with software. We present our findings on source code comments, commit logs, bug reports, string messages and content present in the popular question and answer forum, StackOverflow. Each of the artifact that we examine is a natural language artifact that is associated with software. However, they do not exhibit the same amount of regularity (naturalness). Commit logs were the most regular, followed by string literal messages and source code comments. Content from StackOverflow (viz., title, question and answers) showed a behavior similar to traditional English text i.e., comparatively lesser regularity. Bug reports from industrial projects exhibited more regularity than bug reports from open source projects, whose naturalness resembled that of typical English text. Our findings have implications for feasibility of building tools such as comment and bug report completion engines. We describe a next-word prediction tool that we built using the N-Gram language model. This tool achieved an accuracy ranging from 70 to 90% on commit messages in different projects. It also achieved an accuracy ranging from 56 to 78% on source comments. We also present a part of speech based analysis of words that are easy to predict and difficult to predict.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 8th India Software Engineering Conference

自引率

0.00%

发文量