CARLDA: An Approach for Stack Overflow API Mention Recognition Driven by Context and LLM-Based Data Augmentation

IF 1.7 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Software-Evolution and Process Pub Date : 2025-04-10 DOI:10.1002/smr.70015

Zhang Zhang, Xinjun Mao, Shangwen Wang, Kang Yang, Tanghaoran Zhang, Yao Lu

{"title":"CARLDA: An Approach for Stack Overflow API Mention Recognition Driven by Context and LLM-Based Data Augmentation","authors":"Zhang Zhang, Xinjun Mao, Shangwen Wang, Kang Yang, Tanghaoran Zhang, Yao Lu","doi":"10.1002/smr.70015","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>The recognition of Application Programming Interface (API) mentions in software-related texts is vital for extracting API-related knowledge, providing deep insights into API usage and enhancing productivity efficiency. Previous research identifies two primary technical challenges in this task: (1) differentiating APIs from common words and (2) identifying morphological variants of standard APIs. While deep learning-based methods have demonstrated advancements in addressing these challenges, they rely heavily on high-quality labeled data, leading to another significant data-related challenge: (3) the lack of such high-quality data due to the substantial effort required for labeling. To overcome these challenges, this paper proposes a context-aware API recognition method named CARLDA. This approach utilizes two key components, namely, Bidirectional Encoder Representations from Transformers (BERT) and Bidirectional Long Short-Term Memory (BiLSTM), to extract context at both the word and sequence levels, capturing syntactic and semantic information to address the first challenge. For the second challenge, it incorporates a character-level BiLSTM with an attention mechanism to grasp global character-level context, enhancing the recognition of morphological features of APIs. To address the third challenge, we developed specialized data augmentation techniques using large language models (LLMs) to tackle both in-library and cross-library data shortages. These techniques generate a variety of labeled samples through targeted transformations (e.g., replacing tokens and restructuring sentences) and hybrid augmentation strategies (e.g., combining real-world and generated data while applying style rules to replicate authentic programming contexts). Given the uncertainty about the quality of LLM-generated samples, we also developed sample selection algorithms to filter out low-quality samples (i.e., incomplete or incorrectly labeled samples). Moreover, specific datasets have been constructed to evaluate CARLDA's ability to address the aforementioned challenges. Experimental results demonstrate that (1) CARLDA significantly enhances F1 by 11.0% and the Matthews correlation coefficient (MCC) by 10.0% compared to state-of-the-art methods, showing superior overall performance and effectively tackling the first two challenges, and (2) LLM-based data augmentation techniques successfully yield high-quality labeled data and effectively alleviate the third challenge.</p>\n </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 4","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Software-Evolution and Process","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/smr.70015","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

The recognition of Application Programming Interface (API) mentions in software-related texts is vital for extracting API-related knowledge, providing deep insights into API usage and enhancing productivity efficiency. Previous research identifies two primary technical challenges in this task: (1) differentiating APIs from common words and (2) identifying morphological variants of standard APIs. While deep learning-based methods have demonstrated advancements in addressing these challenges, they rely heavily on high-quality labeled data, leading to another significant data-related challenge: (3) the lack of such high-quality data due to the substantial effort required for labeling. To overcome these challenges, this paper proposes a context-aware API recognition method named CARLDA. This approach utilizes two key components, namely, Bidirectional Encoder Representations from Transformers (BERT) and Bidirectional Long Short-Term Memory (BiLSTM), to extract context at both the word and sequence levels, capturing syntactic and semantic information to address the first challenge. For the second challenge, it incorporates a character-level BiLSTM with an attention mechanism to grasp global character-level context, enhancing the recognition of morphological features of APIs. To address the third challenge, we developed specialized data augmentation techniques using large language models (LLMs) to tackle both in-library and cross-library data shortages. These techniques generate a variety of labeled samples through targeted transformations (e.g., replacing tokens and restructuring sentences) and hybrid augmentation strategies (e.g., combining real-world and generated data while applying style rules to replicate authentic programming contexts). Given the uncertainty about the quality of LLM-generated samples, we also developed sample selection algorithms to filter out low-quality samples (i.e., incomplete or incorrectly labeled samples). Moreover, specific datasets have been constructed to evaluate CARLDA's ability to address the aforementioned challenges. Experimental results demonstrate that (1) CARLDA significantly enhances F1 by 11.0% and the Matthews correlation coefficient (MCC) by 10.0% compared to state-of-the-art methods, showing superior overall performance and effectively tackling the first two challenges, and (2) LLM-based data augmentation techniques successfully yield high-quality labeled data and effectively alleviate the third challenge.

查看原文本刊更多论文

基于上下文和基于llm的数据增强驱动的堆栈溢出API提及识别方法

识别软件相关文本中提到的应用程序编程接口（API）对于提取与API相关的知识、提供对API使用的深入了解和提高生产力效率至关重要。先前的研究确定了这项任务的两个主要技术挑战：(1)将api与常用单词区分开来；(2)识别标准api的形态学变体。虽然基于深度学习的方法在解决这些挑战方面已经取得了进展，但它们严重依赖于高质量的标记数据，从而导致了另一个与数据相关的重大挑战：(3)由于标记需要大量的努力而缺乏高质量的数据。为了克服这些挑战，本文提出了一种上下文感知的API识别方法CARLDA。该方法利用两个关键组件，即变形器的双向编码器表示（BERT）和双向长短期记忆（BiLSTM），在单词和序列级别提取上下文，捕获语法和语义信息以解决第一个挑战。第二个挑战是引入具有注意机制的字符级BiLSTM，以掌握全局字符级上下文，增强对api形态特征的识别。为了解决第三个挑战，我们使用大型语言模型（llm）开发了专门的数据增强技术，以解决库内和跨库的数据短缺问题。这些技术通过有针对性的转换（例如，替换标记和重组句子）和混合增强策略（例如，在应用风格规则复制真实编程上下文的同时，将真实世界和生成的数据结合起来）生成各种标记样本。考虑到llm生成的样本质量的不确定性，我们还开发了样本选择算法来过滤掉低质量的样本（即不完整或标记错误的样本）。此外，已经构建了特定的数据集来评估carda解决上述挑战的能力。实验结果表明：(1)与现有方法相比，carda的F1和Matthews相关系数（MCC）分别提高了11.0%和10.0%，整体性能优越，有效解决了前两个挑战；(2)基于llm的数据增强技术成功生成了高质量的标记数据，有效缓解了第三个挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Software-Evolution and Process COMPUTER SCIENCE, SOFTWARE ENGINEERING-

自引率

10.00%

发文量

109