On-demand generation of high-quality software engineering datasets using large language models and ontologies

IF 3.1 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering Pub Date : 2026-05-04 DOI:10.1007/s10515-026-00617-w

George Bishop, Suranjan Chakraborty, Honghe Zhou, Josh Dehlinger, Lin Deng, Jonah Lin, Benjamin Kist

{"title":"On-demand generation of high-quality software engineering datasets using large language models and ontologies","authors":"George Bishop, Suranjan Chakraborty, Honghe Zhou, Josh Dehlinger, Lin Deng, Jonah Lin, Benjamin Kist","doi":"10.1007/s10515-026-00617-w","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Recent advances in generative artificial intelligence (AI) and machine learning (ML) have renewed interest in realizing the long-standing goal of computer-aided software engineering by improving software quality and productivity. Although these techniques have been applied across many software engineering (SE) tasks, their effectiveness depends heavily on access to large, high-quality, labeled, domain-specific datasets, which remain limited, particularly in requirements engineering (RE) where research often relies on natural language artifacts. Existing, public datasets are typically small, contain labeling ambiguities, and show substantial class imbalance, which restricts the development, evaluation, and reproducibility of AI-driven SE approaches. To address these challenges, this paper presents the O3DG approach, a repeatable method for generating on-demand, high-quality, ontology-aligned datasets using large language models (LLMs). O3DG integrates prompt engineering strategies, domain-specific seed examples, and ML-based validation to synthesize diverse and cohesive datasets suitable for SE research. The approach is demonstrated through two representative RE case studies involving the classification of non-functional requirements and the detection of ambiguity in software requirements. For each case, the paper details the O3DG pipeline, ontology mappings, and validation steps that ensure dataset reliability and practical utility. Results show that O3DG produces datasets with strong category cohesion, improved balance across classes, and effective support for ML training. More broadly, the study illustrates how LLM-assisted dataset synthesis can help overcome persistent data limitations and provides a transferable process for producing high-quality datasets across additional SE domains.</p>\n </div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 3","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-026-00617-w.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-026-00617-w","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advances in generative artificial intelligence (AI) and machine learning (ML) have renewed interest in realizing the long-standing goal of computer-aided software engineering by improving software quality and productivity. Although these techniques have been applied across many software engineering (SE) tasks, their effectiveness depends heavily on access to large, high-quality, labeled, domain-specific datasets, which remain limited, particularly in requirements engineering (RE) where research often relies on natural language artifacts. Existing, public datasets are typically small, contain labeling ambiguities, and show substantial class imbalance, which restricts the development, evaluation, and reproducibility of AI-driven SE approaches. To address these challenges, this paper presents the O3DG approach, a repeatable method for generating on-demand, high-quality, ontology-aligned datasets using large language models (LLMs). O3DG integrates prompt engineering strategies, domain-specific seed examples, and ML-based validation to synthesize diverse and cohesive datasets suitable for SE research. The approach is demonstrated through two representative RE case studies involving the classification of non-functional requirements and the detection of ambiguity in software requirements. For each case, the paper details the O3DG pipeline, ontology mappings, and validation steps that ensure dataset reliability and practical utility. Results show that O3DG produces datasets with strong category cohesion, improved balance across classes, and effective support for ML training. More broadly, the study illustrates how LLM-assisted dataset synthesis can help overcome persistent data limitations and provides a transferable process for producing high-quality datasets across additional SE domains.

Abstract Image

查看原文本刊更多论文

使用大型语言模型和本体按需生成高质量的软件工程数据集

生成式人工智能（AI）和机器学习（ML）的最新进展重新燃起了人们对通过提高软件质量和生产力来实现计算机辅助软件工程长期目标的兴趣。尽管这些技术已经在许多软件工程（SE）任务中得到了应用，但是它们的有效性很大程度上依赖于对大型的、高质量的、有标签的、特定于领域的数据集的访问，这些数据集仍然是有限的，特别是在需求工程（RE）中，其中的研究通常依赖于自然语言工件。现有的公共数据集通常很小，包含标记歧义，并且显示出严重的类别不平衡，这限制了人工智能驱动的SE方法的开发、评估和可重复性。为了应对这些挑战，本文提出了O3DG方法，这是一种可重复的方法，用于使用大型语言模型（llm）生成按需、高质量、与本体一致的数据集。O3DG集成了快速工程策略、特定领域的种子示例和基于ml的验证，以合成适合SE研究的多样化和内聚数据集。该方法通过两个代表性的RE案例研究进行了演示，涉及非功能需求的分类和软件需求中的歧义检测。对于每种情况，本文详细介绍了O3DG管道，本体映射和验证步骤，以确保数据集的可靠性和实用性。结果表明，O3DG生成的数据集具有很强的类别内聚性，改善了类间的平衡，并有效地支持ML训练。更广泛地说，该研究说明了llm辅助数据集合成如何帮助克服持久的数据限制，并为跨其他SE域生成高质量数据集提供了一个可转移的过程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Automated Software Engineering 工程技术-计算机：软件工程

CiteScore

4.80

自引率

11.80%

发文量

审稿时长

>12 weeks

期刊介绍： This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes. Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.