Are deep neural networks the best choice for modeling source code?

Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering Pub Date : 2017-08-21 DOI:10.1145/3106237.3106290

V. Hellendoorn, Premkumar T. Devanbu

{"title":"Are deep neural networks the best choice for modeling source code?","authors":"V. Hellendoorn, Premkumar T. Devanbu","doi":"10.1145/3106237.3106290","DOIUrl":null,"url":null,"abstract":"Current statistical language modeling techniques, including deep-learning based models, have proven to be quite effective for source code. We argue here that the special properties of source code can be exploited for further improvements. In this work, we enhance established language modeling approaches to handle the special challenges of modeling source code, such as: frequent changes, larger, changing vocabularies, deeply nested scopes, etc. We present a fast, nested language modeling toolkit specifically designed for software, with the ability to add & remove text, and mix & swap out many models. Specifically, we improve upon prior cache-modeling work and present a model with a much more expansive, multi-level notion of locality that we show to be well-suited for modeling software. We present results on varying corpora in comparison with traditional N-gram, as well as RNN, and LSTM deep-learning language models, and release all our source code for public use. Our evaluations suggest that carefully adapting N-gram models for source code can yield performance that surpasses even RNN and LSTM based deep-learning models.","PeriodicalId":313494,"journal":{"name":"Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"270","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3106237.3106290","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 270

Abstract

Current statistical language modeling techniques, including deep-learning based models, have proven to be quite effective for source code. We argue here that the special properties of source code can be exploited for further improvements. In this work, we enhance established language modeling approaches to handle the special challenges of modeling source code, such as: frequent changes, larger, changing vocabularies, deeply nested scopes, etc. We present a fast, nested language modeling toolkit specifically designed for software, with the ability to add & remove text, and mix & swap out many models. Specifically, we improve upon prior cache-modeling work and present a model with a much more expansive, multi-level notion of locality that we show to be well-suited for modeling software. We present results on varying corpora in comparison with traditional N-gram, as well as RNN, and LSTM deep-learning language models, and release all our source code for public use. Our evaluations suggest that carefully adapting N-gram models for source code can yield performance that surpasses even RNN and LSTM based deep-learning models.

查看原文本刊更多论文

深度神经网络是源代码建模的最佳选择吗?

目前的统计语言建模技术，包括基于深度学习的模型，已被证明对源代码非常有效。我们认为源代码的特殊属性可以用于进一步的改进。在这项工作中，我们增强了已建立的语言建模方法，以处理源代码建模的特殊挑战，例如:频繁更改、更大、不断变化的词汇表、深度嵌套作用域等。我们提供了一个专门为软件设计的快速、嵌套的语言建模工具包，具有添加和删除文本以及混合和交换许多模型的能力。具体来说，我们改进了先前的缓存建模工作，并提出了一个具有更广泛、多层次的局域性概念的模型，我们认为它非常适合建模软件。与传统的N-gram、RNN和LSTM深度学习语言模型相比，我们展示了不同语料库上的结果，并发布了所有源代码供公众使用。我们的评估表明，仔细调整源代码的N-gram模型可以产生甚至超过基于RNN和LSTM的深度学习模型的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering

自引率

0.00%

发文量