George Bishop, Suranjan Chakraborty, Honghe Zhou, Josh Dehlinger, Lin Deng, Jonah Lin, Benjamin Kist
{"title":"On-demand generation of high-quality software engineering datasets using large language models and ontologies","authors":"George Bishop, Suranjan Chakraborty, Honghe Zhou, Josh Dehlinger, Lin Deng, Jonah Lin, Benjamin Kist","doi":"10.1007/s10515-026-00617-w","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Recent advances in generative artificial intelligence (AI) and machine learning (ML) have renewed interest in realizing the long-standing goal of computer-aided software engineering by improving software quality and productivity. Although these techniques have been applied across many software engineering (SE) tasks, their effectiveness depends heavily on access to large, high-quality, labeled, domain-specific datasets, which remain limited, particularly in requirements engineering (RE) where research often relies on natural language artifacts. Existing, public datasets are typically small, contain labeling ambiguities, and show substantial class imbalance, which restricts the development, evaluation, and reproducibility of AI-driven SE approaches. To address these challenges, this paper presents the O3DG approach, a repeatable method for generating on-demand, high-quality, ontology-aligned datasets using large language models (LLMs). O3DG integrates prompt engineering strategies, domain-specific seed examples, and ML-based validation to synthesize diverse and cohesive datasets suitable for SE research. The approach is demonstrated through two representative RE case studies involving the classification of non-functional requirements and the detection of ambiguity in software requirements. For each case, the paper details the O3DG pipeline, ontology mappings, and validation steps that ensure dataset reliability and practical utility. Results show that O3DG produces datasets with strong category cohesion, improved balance across classes, and effective support for ML training. More broadly, the study illustrates how LLM-assisted dataset synthesis can help overcome persistent data limitations and provides a transferable process for producing high-quality datasets across additional SE domains.</p>\n </div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 3","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-026-00617-w.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-026-00617-w","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Recent advances in generative artificial intelligence (AI) and machine learning (ML) have renewed interest in realizing the long-standing goal of computer-aided software engineering by improving software quality and productivity. Although these techniques have been applied across many software engineering (SE) tasks, their effectiveness depends heavily on access to large, high-quality, labeled, domain-specific datasets, which remain limited, particularly in requirements engineering (RE) where research often relies on natural language artifacts. Existing, public datasets are typically small, contain labeling ambiguities, and show substantial class imbalance, which restricts the development, evaluation, and reproducibility of AI-driven SE approaches. To address these challenges, this paper presents the O3DG approach, a repeatable method for generating on-demand, high-quality, ontology-aligned datasets using large language models (LLMs). O3DG integrates prompt engineering strategies, domain-specific seed examples, and ML-based validation to synthesize diverse and cohesive datasets suitable for SE research. The approach is demonstrated through two representative RE case studies involving the classification of non-functional requirements and the detection of ambiguity in software requirements. For each case, the paper details the O3DG pipeline, ontology mappings, and validation steps that ensure dataset reliability and practical utility. Results show that O3DG produces datasets with strong category cohesion, improved balance across classes, and effective support for ML training. More broadly, the study illustrates how LLM-assisted dataset synthesis can help overcome persistent data limitations and provides a transferable process for producing high-quality datasets across additional SE domains.
期刊介绍:
This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes.
Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.