CARLDA: An Approach for Stack Overflow API Mention Recognition Driven by Context and LLM-Based Data Augmentation

IF 1.7 4区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Zhang Zhang, Xinjun Mao, Shangwen Wang, Kang Yang, Tanghaoran Zhang, Yao Lu
{"title":"CARLDA: An Approach for Stack Overflow API Mention Recognition Driven by Context and LLM-Based Data Augmentation","authors":"Zhang Zhang,&nbsp;Xinjun Mao,&nbsp;Shangwen Wang,&nbsp;Kang Yang,&nbsp;Tanghaoran Zhang,&nbsp;Yao Lu","doi":"10.1002/smr.70015","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>The recognition of Application Programming Interface (API) mentions in software-related texts is vital for extracting API-related knowledge, providing deep insights into API usage and enhancing productivity efficiency. Previous research identifies two primary technical challenges in this task: (1) differentiating APIs from common words and (2) identifying morphological variants of standard APIs. While deep learning-based methods have demonstrated advancements in addressing these challenges, they rely heavily on high-quality labeled data, leading to another significant data-related challenge: (3) the lack of such high-quality data due to the substantial effort required for labeling. To overcome these challenges, this paper proposes a context-aware API recognition method named CARLDA. This approach utilizes two key components, namely, Bidirectional Encoder Representations from Transformers (BERT) and Bidirectional Long Short-Term Memory (BiLSTM), to extract context at both the word and sequence levels, capturing syntactic and semantic information to address the first challenge. For the second challenge, it incorporates a character-level BiLSTM with an attention mechanism to grasp global character-level context, enhancing the recognition of morphological features of APIs. To address the third challenge, we developed specialized data augmentation techniques using large language models (LLMs) to tackle both in-library and cross-library data shortages. These techniques generate a variety of labeled samples through targeted transformations (e.g., replacing tokens and restructuring sentences) and hybrid augmentation strategies (e.g., combining real-world and generated data while applying style rules to replicate authentic programming contexts). Given the uncertainty about the quality of LLM-generated samples, we also developed sample selection algorithms to filter out low-quality samples (i.e., incomplete or incorrectly labeled samples). Moreover, specific datasets have been constructed to evaluate CARLDA's ability to address the aforementioned challenges. Experimental results demonstrate that (1) CARLDA significantly enhances F1 by 11.0% and the Matthews correlation coefficient (MCC) by 10.0% compared to state-of-the-art methods, showing superior overall performance and effectively tackling the first two challenges, and (2) LLM-based data augmentation techniques successfully yield high-quality labeled data and effectively alleviate the third challenge.</p>\n </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 4","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Software-Evolution and Process","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/smr.70015","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

The recognition of Application Programming Interface (API) mentions in software-related texts is vital for extracting API-related knowledge, providing deep insights into API usage and enhancing productivity efficiency. Previous research identifies two primary technical challenges in this task: (1) differentiating APIs from common words and (2) identifying morphological variants of standard APIs. While deep learning-based methods have demonstrated advancements in addressing these challenges, they rely heavily on high-quality labeled data, leading to another significant data-related challenge: (3) the lack of such high-quality data due to the substantial effort required for labeling. To overcome these challenges, this paper proposes a context-aware API recognition method named CARLDA. This approach utilizes two key components, namely, Bidirectional Encoder Representations from Transformers (BERT) and Bidirectional Long Short-Term Memory (BiLSTM), to extract context at both the word and sequence levels, capturing syntactic and semantic information to address the first challenge. For the second challenge, it incorporates a character-level BiLSTM with an attention mechanism to grasp global character-level context, enhancing the recognition of morphological features of APIs. To address the third challenge, we developed specialized data augmentation techniques using large language models (LLMs) to tackle both in-library and cross-library data shortages. These techniques generate a variety of labeled samples through targeted transformations (e.g., replacing tokens and restructuring sentences) and hybrid augmentation strategies (e.g., combining real-world and generated data while applying style rules to replicate authentic programming contexts). Given the uncertainty about the quality of LLM-generated samples, we also developed sample selection algorithms to filter out low-quality samples (i.e., incomplete or incorrectly labeled samples). Moreover, specific datasets have been constructed to evaluate CARLDA's ability to address the aforementioned challenges. Experimental results demonstrate that (1) CARLDA significantly enhances F1 by 11.0% and the Matthews correlation coefficient (MCC) by 10.0% compared to state-of-the-art methods, showing superior overall performance and effectively tackling the first two challenges, and (2) LLM-based data augmentation techniques successfully yield high-quality labeled data and effectively alleviate the third challenge.

求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Software-Evolution and Process
Journal of Software-Evolution and Process COMPUTER SCIENCE, SOFTWARE ENGINEERING-
自引率
10.00%
发文量
109
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信