Naturalness and Artifice of Code: Exploiting the Bi-Modality

15th Innovations in Software Engineering Conference Pub Date : 2022-02-24 DOI:10.1145/3511430.3511915

Prem Devanbu

{"title":"Naturalness and Artifice of Code: Exploiting the Bi-Modality","authors":"Prem Devanbu","doi":"10.1145/3511430.3511915","DOIUrl":null,"url":null,"abstract":"While natural languages are rich in vocabulary and grammatical flexibility, most human are mundane and repetitive. This repetitiveness in natural language has led to great advances in statistical NLP methods. In our lab, we discovered (almost a decade ago) that, despite the considerable power and flexibility of programming languages, large software corpora are actually even more repetitive than NL Corpora. We also showed that this “naturalness” of code could be captured in language models, and exploited within software tools. This line of work has prospered, and been turbo-charged by the tremendous capacity and design flexibility of deep learning models. Numerous other creative and interesting applications of naturalness have ensued, from colleagues around the world, and several industrial applications have emerged. Recently, we have been studying the consequences and opportunities arising from the observation that Software is bimodal: it’s written not only to be run on machines, but also read by humans; this makes software amenable to both algorithmic analysis, and statistical prediction. Bimodality allows new ways of training machine learning models, new ways of designing analysis algorithms, and new ways to understand the practice of programming. In this talk, I will begin with a backgrounder on ”Naturalness” studies, and the promise of bimodality.","PeriodicalId":138760,"journal":{"name":"15th Innovations in Software Engineering Conference","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"15th Innovations in Software Engineering Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3511430.3511915","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

While natural languages are rich in vocabulary and grammatical flexibility, most human are mundane and repetitive. This repetitiveness in natural language has led to great advances in statistical NLP methods. In our lab, we discovered (almost a decade ago) that, despite the considerable power and flexibility of programming languages, large software corpora are actually even more repetitive than NL Corpora. We also showed that this “naturalness” of code could be captured in language models, and exploited within software tools. This line of work has prospered, and been turbo-charged by the tremendous capacity and design flexibility of deep learning models. Numerous other creative and interesting applications of naturalness have ensued, from colleagues around the world, and several industrial applications have emerged. Recently, we have been studying the consequences and opportunities arising from the observation that Software is bimodal: it’s written not only to be run on machines, but also read by humans; this makes software amenable to both algorithmic analysis, and statistical prediction. Bimodality allows new ways of training machine learning models, new ways of designing analysis algorithms, and new ways to understand the practice of programming. In this talk, I will begin with a backgrounder on ”Naturalness” studies, and the promise of bimodality.

查看原文本刊更多论文

代码的自然性与巧夺天工:利用双模态

虽然自然语言词汇丰富，语法灵活，但大多数人类语言都是平凡和重复的。自然语言的这种重复性导致了统计NLP方法的巨大进步。在我们的实验室中，我们发现(大约十年前)，尽管编程语言具有相当大的功能和灵活性，但大型软件语料库实际上比自然语言语料库更具重复性。我们还展示了代码的这种“自然性”可以在语言模型中被捕获，并在软件工具中被利用。由于深度学习模型的巨大容量和设计灵活性，这一行的工作已经得到了蓬勃发展。随后，世界各地的同事提出了许多其他创造性和有趣的自然应用，并出现了一些工业应用。最近，我们一直在研究软件是双峰性的观察所带来的后果和机会:它不仅是为了在机器上运行而编写的，而且也是为了供人类阅读而编写的;这使得软件既适用于算法分析，也适用于统计预测。双峰允许新的方法来训练机器学习模型，新的方法来设计分析算法，新的方法来理解编程的实践。在这次演讲中，我将从“自然性”研究的背景和双峰性的前景开始。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

15th Innovations in Software Engineering Conference

自引率

0.00%

发文量