Zhang Zhang, Xinjun Mao, Shangwen Wang, Kang Yang, Tanghaoran Zhang, Yao Lu
{"title":"CARLDA: An Approach for Stack Overflow API Mention Recognition Driven by Context and LLM-Based Data Augmentation","authors":"Zhang Zhang, Xinjun Mao, Shangwen Wang, Kang Yang, Tanghaoran Zhang, Yao Lu","doi":"10.1002/smr.70015","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>The recognition of Application Programming Interface (API) mentions in software-related texts is vital for extracting API-related knowledge, providing deep insights into API usage and enhancing productivity efficiency. Previous research identifies two primary technical challenges in this task: (1) differentiating APIs from common words and (2) identifying morphological variants of standard APIs. While deep learning-based methods have demonstrated advancements in addressing these challenges, they rely heavily on high-quality labeled data, leading to another significant data-related challenge: (3) the lack of such high-quality data due to the substantial effort required for labeling. To overcome these challenges, this paper proposes a context-aware API recognition method named CARLDA. This approach utilizes two key components, namely, Bidirectional Encoder Representations from Transformers (BERT) and Bidirectional Long Short-Term Memory (BiLSTM), to extract context at both the word and sequence levels, capturing syntactic and semantic information to address the first challenge. For the second challenge, it incorporates a character-level BiLSTM with an attention mechanism to grasp global character-level context, enhancing the recognition of morphological features of APIs. To address the third challenge, we developed specialized data augmentation techniques using large language models (LLMs) to tackle both in-library and cross-library data shortages. These techniques generate a variety of labeled samples through targeted transformations (e.g., replacing tokens and restructuring sentences) and hybrid augmentation strategies (e.g., combining real-world and generated data while applying style rules to replicate authentic programming contexts). Given the uncertainty about the quality of LLM-generated samples, we also developed sample selection algorithms to filter out low-quality samples (i.e., incomplete or incorrectly labeled samples). Moreover, specific datasets have been constructed to evaluate CARLDA's ability to address the aforementioned challenges. Experimental results demonstrate that (1) CARLDA significantly enhances F1 by 11.0% and the Matthews correlation coefficient (MCC) by 10.0% compared to state-of-the-art methods, showing superior overall performance and effectively tackling the first two challenges, and (2) LLM-based data augmentation techniques successfully yield high-quality labeled data and effectively alleviate the third challenge.</p>\n </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 4","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Software-Evolution and Process","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/smr.70015","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
The recognition of Application Programming Interface (API) mentions in software-related texts is vital for extracting API-related knowledge, providing deep insights into API usage and enhancing productivity efficiency. Previous research identifies two primary technical challenges in this task: (1) differentiating APIs from common words and (2) identifying morphological variants of standard APIs. While deep learning-based methods have demonstrated advancements in addressing these challenges, they rely heavily on high-quality labeled data, leading to another significant data-related challenge: (3) the lack of such high-quality data due to the substantial effort required for labeling. To overcome these challenges, this paper proposes a context-aware API recognition method named CARLDA. This approach utilizes two key components, namely, Bidirectional Encoder Representations from Transformers (BERT) and Bidirectional Long Short-Term Memory (BiLSTM), to extract context at both the word and sequence levels, capturing syntactic and semantic information to address the first challenge. For the second challenge, it incorporates a character-level BiLSTM with an attention mechanism to grasp global character-level context, enhancing the recognition of morphological features of APIs. To address the third challenge, we developed specialized data augmentation techniques using large language models (LLMs) to tackle both in-library and cross-library data shortages. These techniques generate a variety of labeled samples through targeted transformations (e.g., replacing tokens and restructuring sentences) and hybrid augmentation strategies (e.g., combining real-world and generated data while applying style rules to replicate authentic programming contexts). Given the uncertainty about the quality of LLM-generated samples, we also developed sample selection algorithms to filter out low-quality samples (i.e., incomplete or incorrectly labeled samples). Moreover, specific datasets have been constructed to evaluate CARLDA's ability to address the aforementioned challenges. Experimental results demonstrate that (1) CARLDA significantly enhances F1 by 11.0% and the Matthews correlation coefficient (MCC) by 10.0% compared to state-of-the-art methods, showing superior overall performance and effectively tackling the first two challenges, and (2) LLM-based data augmentation techniques successfully yield high-quality labeled data and effectively alleviate the third challenge.