CHIS: A Novel Hybrid Granularity Identifier Splitting Approach

2021 28th Asia-Pacific Software Engineering Conference (APSEC) Pub Date : 2021-12-01 DOI:10.1109/APSEC53868.2021.00027

Siyuan Liu, Jingxuan Zhang, Jiahui Liang, Junpeng Luo, Yong Xu, Chenxing Sun

{"title":"CHIS: A Novel Hybrid Granularity Identifier Splitting Approach","authors":"Siyuan Liu, Jingxuan Zhang, Jiahui Liang, Junpeng Luo, Yong Xu, Chenxing Sun","doi":"10.1109/APSEC53868.2021.00027","DOIUrl":null,"url":null,"abstract":"Information Retrieval (IR) techniques have been widely utilized by a growing number of software maintenance activities. However, there is a mismatch between source code lexicon (especially identifiers) and vocabulary in software artifacts, leading to the inefficiency of IR techniques. Consequently, it is essential to normalize identifiers, whose aim is to parse identifiers into several natural language terms. Identifier splitting significantly impacts on the effectiveness of identifier normalization. Even though researchers have proposed several approaches to split identifiers, three main drawbacks remain to be resolved, including without considering morphemes, over-splitting, and under-splitting. In this paper, we propose a new Character-level Hybrid-granularity Identifier Splitting approach CHIS to resolve the three drawbacks and better split identifiers. CHIS combines the Bidirectional Encoder Representation from Transformers (BERT) and Conditional Random Fields (CRF) to train a deep learning model to split identifiers. In addition, CHIS further employs a pre-processing component and a post-processing component to resolve the morpheme acquisition drawback and the over-splitting as well as the under-splitting drawbacks respectively, thus further improving its performance. Specifically, in the pre-processing component, CHIS obtains and labels the most frequent subwords of the training identifiers as morphemes through the Byte Pair Encoding (BPE) algorithm and the sequence labeling algorithm. In the post-processing component, CHIS iteratively merges and splits the splitting results obtained by the deep learning model to resolve the over-splitting and under-splitting drawbacks. We conduct extensive experiments to show the effectiveness of CHIS. Experimental results show that CHIS achieves the Accuracy of 0.943 on average and outperforms the state-of-the-art approach by 0.085 on average. In addition, the effectiveness of the pre-processing and post-processing components of CHIS are also validated.","PeriodicalId":143800,"journal":{"name":"2021 28th Asia-Pacific Software Engineering Conference (APSEC)","volume":"292 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 28th Asia-Pacific Software Engineering Conference (APSEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSEC53868.2021.00027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Information Retrieval (IR) techniques have been widely utilized by a growing number of software maintenance activities. However, there is a mismatch between source code lexicon (especially identifiers) and vocabulary in software artifacts, leading to the inefficiency of IR techniques. Consequently, it is essential to normalize identifiers, whose aim is to parse identifiers into several natural language terms. Identifier splitting significantly impacts on the effectiveness of identifier normalization. Even though researchers have proposed several approaches to split identifiers, three main drawbacks remain to be resolved, including without considering morphemes, over-splitting, and under-splitting. In this paper, we propose a new Character-level Hybrid-granularity Identifier Splitting approach CHIS to resolve the three drawbacks and better split identifiers. CHIS combines the Bidirectional Encoder Representation from Transformers (BERT) and Conditional Random Fields (CRF) to train a deep learning model to split identifiers. In addition, CHIS further employs a pre-processing component and a post-processing component to resolve the morpheme acquisition drawback and the over-splitting as well as the under-splitting drawbacks respectively, thus further improving its performance. Specifically, in the pre-processing component, CHIS obtains and labels the most frequent subwords of the training identifiers as morphemes through the Byte Pair Encoding (BPE) algorithm and the sequence labeling algorithm. In the post-processing component, CHIS iteratively merges and splits the splitting results obtained by the deep learning model to resolve the over-splitting and under-splitting drawbacks. We conduct extensive experiments to show the effectiveness of CHIS. Experimental results show that CHIS achieves the Accuracy of 0.943 on average and outperforms the state-of-the-art approach by 0.085 on average. In addition, the effectiveness of the pre-processing and post-processing components of CHIS are also validated.

查看原文本刊更多论文

一种新的混合粒度标识符分割方法

信息检索(IR)技术已被越来越多的软件维护活动广泛应用。然而，源代码词汇(特别是标识符)和软件工件中的词汇之间存在不匹配，导致IR技术的低效率。因此，有必要对标识符进行规范化，其目的是将标识符解析为几个自然语言术语。标识符分裂严重影响标识符规范化的有效性。尽管研究人员已经提出了几种分割标识符的方法，但仍有三个主要缺点有待解决，包括不考虑语素、过度分割和欠分割。在本文中，我们提出了一种新的字符级混合粒度标识符分割方法CHIS，以解决这三个缺点并更好地分割标识符。CHIS结合了变形器的双向编码器表示(BERT)和条件随机场(CRF)来训练一个深度学习模型来分割标识符。此外，CHIS还采用预处理组件和后处理组件分别解决了语素获取缺陷和过拆分和欠拆分缺陷，从而进一步提高了其性能。具体来说，在预处理部分，CHIS通过字节对编码(Byte Pair Encoding, BPE)算法和序列标记算法，获取训练标识符中出现频率最高的子词，并将其标记为语素。在后处理部分，CHIS对深度学习模型得到的拆分结果进行迭代合并和拆分，解决了过拆分和欠拆分的缺点。我们进行了大量的实验来证明CHIS的有效性。实验结果表明，CHIS的平均准确率为0.943，比现有方法的平均准确率高出0.085。此外，还验证了CHIS的预处理和后处理组件的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 28th Asia-Pacific Software Engineering Conference (APSEC)

自引率

0.00%

发文量