Linguistic Economy Applied to Programming Language Identifiers

软件工程与应用(英文) Pub Date : 2021-01-12 DOI:10.4236/JSEA.2021.141001

Michael Dorin, Sergio Montenegro

{"title":"Linguistic Economy Applied to Programming Language Identifiers","authors":"Michael Dorin, Sergio Montenegro","doi":"10.4236/JSEA.2021.141001","DOIUrl":null,"url":null,"abstract":"Though many different readability metrics have been created, there still is no universal agreement defining readability of software source code. The lack of a clear agreement of source code readability has ramifications in many areas of the software development life-cycle, not least of which being software maintainability. We propose a measurement based on Linguistic Economy to bridge the gap between mathematical and behavioral aspects. Linguistic Economy describes efficiencies of speech and is generally applied to natural languages. In our study, we create a large corpus of words that are likely to be found in a programmer’s vocabulary, and a corpus of existing identifiers found in a collection of open-source projects. We perform a usage analysis to create a database from both of these corpora. Linguistic Economy suggests that words requiring less effort to speak are used more often than words requiring more effort. This concept is applied to measure how difficult program identifiers are to understand by extracting them from the program source and comparing their usage to the database. Through this process, we can identify source code that programmers find difficult to review. We validate our work using data from a survey where programmers identified unpleasant to review source files. The results indicate that source files identified as unpleasant to review source code have more linguistically complicated identifiers than pleasant programs.","PeriodicalId":62222,"journal":{"name":"软件工程与应用(英文)","volume":"14 1","pages":"1-10"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"软件工程与应用(英文)","FirstCategoryId":"1093","ListUrlMain":"https://doi.org/10.4236/JSEA.2021.141001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Though many different readability metrics have been created, there still is no universal agreement defining readability of software source code. The lack of a clear agreement of source code readability has ramifications in many areas of the software development life-cycle, not least of which being software maintainability. We propose a measurement based on Linguistic Economy to bridge the gap between mathematical and behavioral aspects. Linguistic Economy describes efficiencies of speech and is generally applied to natural languages. In our study, we create a large corpus of words that are likely to be found in a programmer’s vocabulary, and a corpus of existing identifiers found in a collection of open-source projects. We perform a usage analysis to create a database from both of these corpora. Linguistic Economy suggests that words requiring less effort to speak are used more often than words requiring more effort. This concept is applied to measure how difficult program identifiers are to understand by extracting them from the program source and comparing their usage to the database. Through this process, we can identify source code that programmers find difficult to review. We validate our work using data from a survey where programmers identified unpleasant to review source files. The results indicate that source files identified as unpleasant to review source code have more linguistically complicated identifiers than pleasant programs.

查看原文本刊更多论文

语言经济在程序设计语言标识符中的应用

尽管已经创建了许多不同的可读性度量，但对软件源代码的可读性的定义仍然没有达成一致。源代码可读性缺乏明确的一致性，这对软件开发生命周期的许多领域都有影响，尤其是软件的可维护性。我们提出了一种基于语言经济学的测量方法，以弥合数学和行为方面之间的差距。语言经济学描述了言语的效率，通常应用于自然语言。在我们的研究中，我们创建了一个可能在程序员词汇表中找到的大型单词语料库，以及一个在开源项目集合中找到的现有标识符语料库。我们进行使用分析，从这两个语料库中创建一个数据库。语言学经济学表明，需要较少努力说话的单词比需要更多努力的单词更经常使用。这个概念被应用于通过从程序源中提取程序标识符并将其使用情况与数据库进行比较来衡量程序标识符的理解难度。通过这个过程，我们可以识别程序员发现难以审查的源代码。我们使用来自一项调查的数据来验证我们的工作，在该调查中，程序员发现了审查源文件的不愉快之处。结果表明，被认定为审查源代码不愉快的源文件在语言上比愉快的程序具有更复杂的标识符。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

软件工程与应用(英文)

自引率

0.00%

发文量

815