Lexical categories for source code identifiers

2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER) Pub Date : 2017-02-01 DOI:10.1109/SANER.2017.7884624

Christian D. Newman, Reem S. Alsuhaibani, M. Collard, Jonathan I. Maletic

{"title":"Lexical categories for source code identifiers","authors":"Christian D. Newman, Reem S. Alsuhaibani, M. Collard, Jonathan I. Maletic","doi":"10.1109/SANER.2017.7884624","DOIUrl":null,"url":null,"abstract":"A set of lexical categories, analogous to part-of-speech categories for English prose, is defined for source-code identifiers. The lexical category for an identifier is determined from its declaration in the source code, syntactic meaning in the programming language, and static program analysis. Current techniques for assigning lexical categories to identifiers use natural-language part-of-speech taggers. However, these NLP approaches assign lexical tags based on how terms are used in English prose. The approach taken here differs in that it uses only source code to determine the lexical category. The approach assigns a lexical category to each identifier and stores this information along with each declaration. srcML is used as the infrastructure to implement the approach and so the lexical information is stored directly in the srcML markup as an additional XML element for each identifier. These lexical-category annotations can then be later used by tools that automatically generate such things as code summarization or documentation. The approach is applied to 50 open source projects and the soundness of the defined lexical categories evaluated. The evaluation shows that at every level of minimum support tested, categorization is consistent at least 79% of the time with an overall consistency (across all supports) of at least 88%. The categories reveal a correlation between how an identifier is named and how it is declared. This provides a syntax-oriented view (as opposed to English part-of-speech view) of developer intent of identifiers.","PeriodicalId":6541,"journal":{"name":"2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"109 1","pages":"228-239"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SANER.2017.7884624","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

Abstract

A set of lexical categories, analogous to part-of-speech categories for English prose, is defined for source-code identifiers. The lexical category for an identifier is determined from its declaration in the source code, syntactic meaning in the programming language, and static program analysis. Current techniques for assigning lexical categories to identifiers use natural-language part-of-speech taggers. However, these NLP approaches assign lexical tags based on how terms are used in English prose. The approach taken here differs in that it uses only source code to determine the lexical category. The approach assigns a lexical category to each identifier and stores this information along with each declaration. srcML is used as the infrastructure to implement the approach and so the lexical information is stored directly in the srcML markup as an additional XML element for each identifier. These lexical-category annotations can then be later used by tools that automatically generate such things as code summarization or documentation. The approach is applied to 50 open source projects and the soundness of the defined lexical categories evaluated. The evaluation shows that at every level of minimum support tested, categorization is consistent at least 79% of the time with an overall consistency (across all supports) of at least 88%. The categories reveal a correlation between how an identifier is named and how it is declared. This provides a syntax-oriented view (as opposed to English part-of-speech view) of developer intent of identifiers.

查看原文本刊更多论文

源代码标识符的词法类别

为源代码标识符定义了一组词汇类别，类似于英语散文的词性类别。标识符的词法类别由其在源代码中的声明、编程语言中的语法含义和静态程序分析确定。当前为标识符分配词法类别的技术使用自然语言词性标注器。然而，这些NLP方法根据术语在英语散文中的使用方式来分配词汇标签。这里采用的方法的不同之处在于，它只使用源代码来确定词法类别。该方法为每个标识符分配一个词法类别，并将此信息与每个声明一起存储。srcML用作实现该方法的基础结构，因此词法信息作为每个标识符的附加XML元素直接存储在srcML标记中。这些词汇类别注释以后可以由自动生成代码摘要或文档之类的工具使用。该方法应用于50个开源项目，并评估了定义的词法类别的可靠性。评估表明，在测试的每个最低支持级别上，分类至少79%的时间是一致的，总体一致性(在所有支持中)至少为88%。类别揭示了标识符的命名方式与声明方式之间的相关性。这提供了开发人员标识符意图的面向语法的视图(与英语词性视图相反)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)

自引率

0.00%

发文量