多级多类需求分类的混合方法：停止词去除和数据增强的影响

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software Pub Date : 2025-08-22 DOI:10.1016/j.jss.2025.112594

Jasleen Kaur , Chanchal Roy

{"title":"多级多类需求分类的混合方法：停止词去除和数据增强的影响","authors":"Jasleen Kaur , Chanchal Roy","doi":"10.1016/j.jss.2025.112594","DOIUrl":null,"url":null,"abstract":"<div><div>Requirement classification in software engineering is essential for effective development. Automating this process reduces human effort and enhances decision-making. Previous studies experimented with machine learning and deep learning models to classify requirements. This novel research fills that gap by evaluating transformer-based models and a proposed Hybrid Stacked Model for multilevel, multi-class classification task. To address the limitations of existing software requirement datasets (imbalanced dataset, insufficient granularity, real world examples), we combined instances from the PROMISE_exp dataset, PURE corpus, and 20 manually collected software requirement specifications (SRS) documents using a Boolean keyword search to create a multilevel, multi-class dataset. These 3072 combined requirements are organized into a two-level hierarchy: Level 1 (functional (FR)/non-functional (NFR)); Level 2 (FRs: core functional (CFR)/derived functional (DFR)/system integration (SI)/external dependency (ED); NFRs: product (PR)/organizational (OR)/external (ER)). We applied BERT-based context-aware text augmentation to address class imbalance by expanding the dataset to 3343 instances. This study also investigates the effects of domain-specific stopword removal and text augmentation on model performance. Results show that text augmentation boosts accuracy by 0.2–3.76% across all models. Stopword removal enhances precision and recall by reducing noise, but it slightly lowers overall accuracy due to the loss of some semantic cues. The proposed Hybrid Stacked Model outperformed all pre-trained transformer models, achieving the highest accuracy of 96.77% at Level 1 and 83.06% at Level 2. A statistical t-test confirms the significance of these improvements. These findings emphasize the importance of hybrid models and domain-specific data preprocessing in enhancing requirement classification, with practical implications for automating early-stage software engineering tasks.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112594"},"PeriodicalIF":4.1000,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hybrid approach for multilevel multi-class requirement classification: Impact of stop-word removal and data augmentation\",\"authors\":\"Jasleen Kaur , Chanchal Roy\",\"doi\":\"10.1016/j.jss.2025.112594\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Requirement classification in software engineering is essential for effective development. Automating this process reduces human effort and enhances decision-making. Previous studies experimented with machine learning and deep learning models to classify requirements. This novel research fills that gap by evaluating transformer-based models and a proposed Hybrid Stacked Model for multilevel, multi-class classification task. To address the limitations of existing software requirement datasets (imbalanced dataset, insufficient granularity, real world examples), we combined instances from the PROMISE_exp dataset, PURE corpus, and 20 manually collected software requirement specifications (SRS) documents using a Boolean keyword search to create a multilevel, multi-class dataset. These 3072 combined requirements are organized into a two-level hierarchy: Level 1 (functional (FR)/non-functional (NFR)); Level 2 (FRs: core functional (CFR)/derived functional (DFR)/system integration (SI)/external dependency (ED); NFRs: product (PR)/organizational (OR)/external (ER)). We applied BERT-based context-aware text augmentation to address class imbalance by expanding the dataset to 3343 instances. This study also investigates the effects of domain-specific stopword removal and text augmentation on model performance. Results show that text augmentation boosts accuracy by 0.2–3.76% across all models. Stopword removal enhances precision and recall by reducing noise, but it slightly lowers overall accuracy due to the loss of some semantic cues. The proposed Hybrid Stacked Model outperformed all pre-trained transformer models, achieving the highest accuracy of 96.77% at Level 1 and 83.06% at Level 2. A statistical t-test confirms the significance of these improvements. These findings emphasize the importance of hybrid models and domain-specific data preprocessing in enhancing requirement classification, with practical implications for automating early-stage software engineering tasks.</div></div>\",\"PeriodicalId\":51099,\"journal\":{\"name\":\"Journal of Systems and Software\",\"volume\":\"231 \",\"pages\":\"Article 112594\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2025-08-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Systems and Software\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0164121225002638\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121225002638","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

软件工程中的需求分类是有效开发的必要条件。这个过程的自动化减少了人力并增强了决策。之前的研究尝试使用机器学习和深度学习模型来对需求进行分类。这项新研究通过评估基于变压器的模型和提出的用于多层次、多类分类任务的混合堆叠模型填补了这一空白。为了解决现有软件需求数据集的局限性（数据不平衡、粒度不足、现实世界的例子），我们使用布尔关键字搜索，将来自PROMISE_exp数据集、PURE语料库和20个手动收集的软件需求规范（SRS）文档的实例结合起来，创建了一个多层次、多类的数据集。这3072个综合需求被组织成两级层次结构：第1级(功能性（FR)/非功能性（NFR））；2级（FRs）：核心功能(CFR)/衍生功能(DFR)/系统集成(SI)/外部依赖性（ED）；nfr：产品(PR)/组织(OR)/外部（ER）。我们通过将数据集扩展到3343个实例，应用基于bert的上下文感知文本增强来解决类不平衡问题。本研究还探讨了特定领域的停词去除和文本增强对模型性能的影响。结果表明，文本增强在所有模型中提高了0.2-3.76%的准确率。停止词去除通过减少噪音来提高准确性和召回率，但由于丢失了一些语义线索，它会略微降低总体准确性。所提出的混合堆叠模型优于所有预训练的变压器模型，在Level 1和Level 2分别达到96.77%和83.06%的最高准确率。统计t检验证实了这些改进的显著性。这些发现强调了混合模型和领域特定数据预处理在增强需求分类中的重要性，并对自动化早期软件工程任务具有实际意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Hybrid approach for multilevel multi-class requirement classification: Impact of stop-word removal and data augmentation

Requirement classification in software engineering is essential for effective development. Automating this process reduces human effort and enhances decision-making. Previous studies experimented with machine learning and deep learning models to classify requirements. This novel research fills that gap by evaluating transformer-based models and a proposed Hybrid Stacked Model for multilevel, multi-class classification task. To address the limitations of existing software requirement datasets (imbalanced dataset, insufficient granularity, real world examples), we combined instances from the PROMISE_exp dataset, PURE corpus, and 20 manually collected software requirement specifications (SRS) documents using a Boolean keyword search to create a multilevel, multi-class dataset. These 3072 combined requirements are organized into a two-level hierarchy: Level 1 (functional (FR)/non-functional (NFR)); Level 2 (FRs: core functional (CFR)/derived functional (DFR)/system integration (SI)/external dependency (ED); NFRs: product (PR)/organizational (OR)/external (ER)). We applied BERT-based context-aware text augmentation to address class imbalance by expanding the dataset to 3343 instances. This study also investigates the effects of domain-specific stopword removal and text augmentation on model performance. Results show that text augmentation boosts accuracy by 0.2–3.76% across all models. Stopword removal enhances precision and recall by reducing noise, but it slightly lowers overall accuracy due to the loss of some semantic cues. The proposed Hybrid Stacked Model outperformed all pre-trained transformer models, achieving the highest accuracy of 96.77% at Level 1 and 83.06% at Level 2. A statistical t-test confirms the significance of these improvements. These findings emphasize the importance of hybrid models and domain-specific data preprocessing in enhancing requirement classification, with practical implications for automating early-stage software engineering tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Systems and Software 工程技术-计算机：理论方法

CiteScore

8.60

自引率

5.70%

发文量

193

审稿时长

16 weeks

期刊介绍： The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to: •Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution •Agile, model-driven, service-oriented, open source and global software development •Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems •Human factors and management concerns of software development •Data management and big data issues of software systems •Metrics and evaluation, data mining of software development resources •Business and economic aspects of software development processes The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.