{"title":"Leveraging LLM-based data augmentation for automatic classification of recurring tasks in software development projects","authors":"Włodzimierz Wysocki , Mirosław Ochodek","doi":"10.1016/j.jss.2025.112641","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>Issue tracking systems (ITS) store project task data that is valuable for analytics and simulation. Projects typically include two types of tasks: stateful and recurring. While stateful tasks can be automatically categorized with relative ease, categorizing recurring tasks remains challenging. Prior research indicates that a key difficulty may lie in the underrepresentation of certain task types, which leads to severely imbalanced training datasets and hampers the accuracy of machine-learning models for task categorization.</div></div><div><h3>Aims:</h3><div>The goal of this study is to evaluate whether leveraging large language models (LLM) for data augmentation can enhance the machine-learning-based categorization of recurring tasks in software projects.</div></div><div><h3>Method:</h3><div>We conduct our study on a dataset from six industrial projects comprising 9,589 tasks. To address class imbalance, we up-sample minority classes during training via data augmentation using LLMs and several prompting strategies, assessing their impact on prediction quality. For each project, we perform time-series 5-fold cross-validation and evaluate the classifiers using state-of-the-art metrics — Accuracy, Precision, Recall, F1-score, and MCC — as well as practice-inspired metric called Monthly Classification Error (MCE) that assess the impact of task misclassification on project planning and resource allocation. Our machine-learning pipeline employs Transformer-based sentence embeddings and XGBoost classifiers.</div></div><div><h3>Results:</h3><div>The model automatically classifies software process tasks into 14 classes, achieving MCC values between 0.71 and 0.76. We observed higher prediction quality for the largest projects in the dataset and for those managed using “traditional” project management methodologies. Moreover, employing intra-project data augmentation strategies reduced the MCE error by up to 43%.</div></div><div><h3>Conclusions:</h3><div>Our findings indicate that large language models (LLMs) can be used to mitigate the impact of imbalanced task categories, thereby enhancing the performance of classification models even with limited training data.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112641"},"PeriodicalIF":4.1000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121225003103","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Background:
Issue tracking systems (ITS) store project task data that is valuable for analytics and simulation. Projects typically include two types of tasks: stateful and recurring. While stateful tasks can be automatically categorized with relative ease, categorizing recurring tasks remains challenging. Prior research indicates that a key difficulty may lie in the underrepresentation of certain task types, which leads to severely imbalanced training datasets and hampers the accuracy of machine-learning models for task categorization.
Aims:
The goal of this study is to evaluate whether leveraging large language models (LLM) for data augmentation can enhance the machine-learning-based categorization of recurring tasks in software projects.
Method:
We conduct our study on a dataset from six industrial projects comprising 9,589 tasks. To address class imbalance, we up-sample minority classes during training via data augmentation using LLMs and several prompting strategies, assessing their impact on prediction quality. For each project, we perform time-series 5-fold cross-validation and evaluate the classifiers using state-of-the-art metrics — Accuracy, Precision, Recall, F1-score, and MCC — as well as practice-inspired metric called Monthly Classification Error (MCE) that assess the impact of task misclassification on project planning and resource allocation. Our machine-learning pipeline employs Transformer-based sentence embeddings and XGBoost classifiers.
Results:
The model automatically classifies software process tasks into 14 classes, achieving MCC values between 0.71 and 0.76. We observed higher prediction quality for the largest projects in the dataset and for those managed using “traditional” project management methodologies. Moreover, employing intra-project data augmentation strategies reduced the MCE error by up to 43%.
Conclusions:
Our findings indicate that large language models (LLMs) can be used to mitigate the impact of imbalanced task categories, thereby enhancing the performance of classification models even with limited training data.
期刊介绍:
The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to:
•Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution
•Agile, model-driven, service-oriented, open source and global software development
•Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems
•Human factors and management concerns of software development
•Data management and big data issues of software systems
•Metrics and evaluation, data mining of software development resources
•Business and economic aspects of software development processes
The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.