需求工程应用的潜在生成语言模型：对当前优势和局限性的洞察

IF 4.6 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Complex & Intelligent Systems Pub Date : 2025-05-08 DOI:10.1007/s40747-024-01707-6

Summra Saleem, Muhammad Nabeel Asim, Ludger Van Elst, Andreas Dengel

{"title":"需求工程应用的潜在生成语言模型：对当前优势和局限性的洞察","authors":"Summra Saleem, Muhammad Nabeel Asim, Ludger Van Elst, Andreas Dengel","doi":"10.1007/s40747-024-01707-6","DOIUrl":null,"url":null,"abstract":"<p>Traditional language models have been extensively evaluated for software engineering domain, however the potential of ChatGPT and Gemini have not been fully explored. To fulfill this gap, the paper in hand presents a comprehensive case study to investigate the potential of both language models for development of diverse types of requirement engineering applications. It deeply explores impact of varying levels of expert knowledge prompts on the prediction accuracies of both language models. Across 4 different public benchmark datasets of requirement engineering tasks, it compares performance of both language models with existing task specific machine/deep learning predictors and traditional language models. Specifically, the paper utilizes 4 benchmark datasets; Pure (7445 samples, requirements extraction), PROMISE (622 samples, requirements classification), REQuestA (300 question answer (QA) pairs) and Aerospace datasets (6347 words, requirements NER tagging). Our experiments reveal that, in comparison to ChatGPT, Gemini requires more careful prompt engineering to provide accurate predictions. Moreover, across requirement extraction benchmark dataset the state-of-the-art F1-score is 0.86 while ChatGPT and Gemini achieved 0.76 and 0.77, respectively. The State-of-the-art F1-score on requirements classification dataset is 0.96 and both language models 0.78. In name entity recognition (NER) task the state-of-the-art F1-score is 0.92 and ChatGPT managed to produce 0.36, and Gemini 0.25. Similarly, across question answering dataset the state-of-the-art F1-score is 0.90 and ChatGPT and Gemini managed to produce 0.91 and 0.88 respectively. Our experiments show that Gemini requires more precise prompt engineering than ChatGPT. Except for question-answering, both models under-perform compared to current state-of-the-art predictors across other tasks.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"104 1","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Generative language models potential for requirement engineering applications: insights into current strengths and limitations\",\"authors\":\"Summra Saleem, Muhammad Nabeel Asim, Ludger Van Elst, Andreas Dengel\",\"doi\":\"10.1007/s40747-024-01707-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Traditional language models have been extensively evaluated for software engineering domain, however the potential of ChatGPT and Gemini have not been fully explored. To fulfill this gap, the paper in hand presents a comprehensive case study to investigate the potential of both language models for development of diverse types of requirement engineering applications. It deeply explores impact of varying levels of expert knowledge prompts on the prediction accuracies of both language models. Across 4 different public benchmark datasets of requirement engineering tasks, it compares performance of both language models with existing task specific machine/deep learning predictors and traditional language models. Specifically, the paper utilizes 4 benchmark datasets; Pure (7445 samples, requirements extraction), PROMISE (622 samples, requirements classification), REQuestA (300 question answer (QA) pairs) and Aerospace datasets (6347 words, requirements NER tagging). Our experiments reveal that, in comparison to ChatGPT, Gemini requires more careful prompt engineering to provide accurate predictions. Moreover, across requirement extraction benchmark dataset the state-of-the-art F1-score is 0.86 while ChatGPT and Gemini achieved 0.76 and 0.77, respectively. The State-of-the-art F1-score on requirements classification dataset is 0.96 and both language models 0.78. In name entity recognition (NER) task the state-of-the-art F1-score is 0.92 and ChatGPT managed to produce 0.36, and Gemini 0.25. Similarly, across question answering dataset the state-of-the-art F1-score is 0.90 and ChatGPT and Gemini managed to produce 0.91 and 0.88 respectively. Our experiments show that Gemini requires more precise prompt engineering than ChatGPT. Except for question-answering, both models under-perform compared to current state-of-the-art predictors across other tasks.</p>\",\"PeriodicalId\":10524,\"journal\":{\"name\":\"Complex & Intelligent Systems\",\"volume\":\"104 1\",\"pages\":\"\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2025-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Complex & Intelligent Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s40747-024-01707-6\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-024-01707-6","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

传统的语言模型已经在软件工程领域得到了广泛的评估，然而ChatGPT和Gemini的潜力还没有得到充分的探索。为了填补这一空白，本文提出了一个全面的案例研究，以调查两种语言模型在开发不同类型的需求工程应用程序方面的潜力。它深入探讨了不同水平的专家知识提示对两种语言模型的预测准确性的影响。在需求工程任务的4个不同的公共基准数据集上，它比较了语言模型与现有任务特定的机器/深度学习预测器和传统语言模型的性能。具体来说，本文利用了4个基准数据集；Pure（7445个样本，需求提取），PROMISE（622个样本，需求分类），REQuestA（300个问题答案（QA）对）和Aerospace数据集（6347个单词，需求NER标记）。我们的实验表明，与ChatGPT相比，Gemini需要更仔细的提示工程来提供准确的预测。此外，在需求提取基准数据集中，最先进的f1得分为0.86，而ChatGPT和Gemini分别达到0.76和0.77。需求分类数据集上的最先进的f1得分为0.96，两种语言模型的得分均为0.78。在名称实体识别（NER）任务中，最先进的f1得分为0.92，ChatGPT达到0.36，Gemini达到0.25。同样，在整个问答数据集中，最先进的f1得分为0.90，ChatGPT和Gemini分别产生了0.91和0.88。我们的实验表明，Gemini需要比ChatGPT更精确的提示工程。除了问答之外，这两个模型在其他任务上的表现都不如目前最先进的预测器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Generative language models potential for requirement engineering applications: insights into current strengths and limitations

Traditional language models have been extensively evaluated for software engineering domain, however the potential of ChatGPT and Gemini have not been fully explored. To fulfill this gap, the paper in hand presents a comprehensive case study to investigate the potential of both language models for development of diverse types of requirement engineering applications. It deeply explores impact of varying levels of expert knowledge prompts on the prediction accuracies of both language models. Across 4 different public benchmark datasets of requirement engineering tasks, it compares performance of both language models with existing task specific machine/deep learning predictors and traditional language models. Specifically, the paper utilizes 4 benchmark datasets; Pure (7445 samples, requirements extraction), PROMISE (622 samples, requirements classification), REQuestA (300 question answer (QA) pairs) and Aerospace datasets (6347 words, requirements NER tagging). Our experiments reveal that, in comparison to ChatGPT, Gemini requires more careful prompt engineering to provide accurate predictions. Moreover, across requirement extraction benchmark dataset the state-of-the-art F1-score is 0.86 while ChatGPT and Gemini achieved 0.76 and 0.77, respectively. The State-of-the-art F1-score on requirements classification dataset is 0.96 and both language models 0.78. In name entity recognition (NER) task the state-of-the-art F1-score is 0.92 and ChatGPT managed to produce 0.36, and Gemini 0.25. Similarly, across question answering dataset the state-of-the-art F1-score is 0.90 and ChatGPT and Gemini managed to produce 0.91 and 0.88 respectively. Our experiments show that Gemini requires more precise prompt engineering than ChatGPT. Except for question-answering, both models under-perform compared to current state-of-the-art predictors across other tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Complex & Intelligent Systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

9.60

自引率

10.30%

发文量

297

期刊介绍： Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.