论软件的自然性

2012 34th International Conference on Software Engineering (ICSE) Pub Date : 2012-06-02 DOI:10.1109/ICSE.2012.6227135

Premkumar T. Devanbu

{"title":"论软件的自然性","authors":"Premkumar T. Devanbu","doi":"10.1109/ICSE.2012.6227135","DOIUrl":null,"url":null,"abstract":"Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension. We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations - and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether a) code can be usefully modeled by statistical language models and b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very repetitive, and in fact even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's built-in completion capability. We conclude the paper by laying out a vision for future research in this area.","PeriodicalId":420187,"journal":{"name":"2012 34th International Conference on Software Engineering (ICSE)","volume":"180 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"794","resultStr":"{\"title\":\"On the naturalness of software\",\"authors\":\"Premkumar T. Devanbu\",\"doi\":\"10.1109/ICSE.2012.6227135\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension. We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations - and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether a) code can be usefully modeled by statistical language models and b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very repetitive, and in fact even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's built-in completion capability. We conclude the paper by laying out a vision for future research in this area.\",\"PeriodicalId\":420187,\"journal\":{\"name\":\"2012 34th International Conference on Software Engineering (ICSE)\",\"volume\":\"180 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-06-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"794\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 34th International Conference on Software Engineering (ICSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSE.2012.6227135\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 34th International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSE.2012.6227135","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 794

摘要

像英语这样的自然语言丰富、复杂、强大。莎士比亚和阿瓦伊亚尔等大师对英语和泰米尔语等语言极具创造性和优雅的运用，当然可以令人愉悦和鼓舞。但在实践中，考虑到认知限制和日常生活的紧急情况，大多数人类的话语都要简单得多，重复得多，可预测得多。事实上，这些话语可以用现代统计方法非常有用地建模。这一事实导致了统计方法在语音识别、自然语言翻译、问答以及文本挖掘和理解方面取得了惊人的成功。我们首先推测，大多数软件也是自然的，从某种意义上说，它是由人类在工作中创造的，伴随着所有的约束和限制——因此，就像自然语言一样，它也可能是重复的和可预测的。然后我们继续询问是否a)代码可以通过统计语言模型有效地建模，以及b)这样的模型可以用来支持软件工程师。使用广泛采用的n-gram模型，我们提供实证证据支持这两个问题的积极答案。我们展示了代码也是非常重复的，实际上甚至比自然语言还要重复。作为该模型的一个示例，我们已经为Java开发了一个简单的代码完成引擎，尽管它很简单，但它已经改进了Eclipse内置的完成功能。最后，我们对这一领域的未来研究进行了展望。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

On the naturalness of software

Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension. We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations - and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether a) code can be usefully modeled by statistical language models and b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very repetitive, and in fact even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's built-in completion capability. We conclude the paper by laying out a vision for future research in this area.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 34th International Conference on Software Engineering (ICSE)

自引率

0.00%

发文量