Leveraging LLM-based data augmentation for automatic classification of recurring tasks in software development projects

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software Pub Date : 2025-09-19 DOI:10.1016/j.jss.2025.112641

Włodzimierz Wysocki , Mirosław Ochodek

{"title":"Leveraging LLM-based data augmentation for automatic classification of recurring tasks in software development projects","authors":"Włodzimierz Wysocki , Mirosław Ochodek","doi":"10.1016/j.jss.2025.112641","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>Issue tracking systems (ITS) store project task data that is valuable for analytics and simulation. Projects typically include two types of tasks: stateful and recurring. While stateful tasks can be automatically categorized with relative ease, categorizing recurring tasks remains challenging. Prior research indicates that a key difficulty may lie in the underrepresentation of certain task types, which leads to severely imbalanced training datasets and hampers the accuracy of machine-learning models for task categorization.</div></div><div><h3>Aims:</h3><div>The goal of this study is to evaluate whether leveraging large language models (LLM) for data augmentation can enhance the machine-learning-based categorization of recurring tasks in software projects.</div></div><div><h3>Method:</h3><div>We conduct our study on a dataset from six industrial projects comprising 9,589 tasks. To address class imbalance, we up-sample minority classes during training via data augmentation using LLMs and several prompting strategies, assessing their impact on prediction quality. For each project, we perform time-series 5-fold cross-validation and evaluate the classifiers using state-of-the-art metrics — Accuracy, Precision, Recall, F1-score, and MCC — as well as practice-inspired metric called Monthly Classification Error (MCE) that assess the impact of task misclassification on project planning and resource allocation. Our machine-learning pipeline employs Transformer-based sentence embeddings and XGBoost classifiers.</div></div><div><h3>Results:</h3><div>The model automatically classifies software process tasks into 14 classes, achieving MCC values between 0.71 and 0.76. We observed higher prediction quality for the largest projects in the dataset and for those managed using “traditional” project management methodologies. Moreover, employing intra-project data augmentation strategies reduced the MCE error by up to 43%.</div></div><div><h3>Conclusions:</h3><div>Our findings indicate that large language models (LLMs) can be used to mitigate the impact of imbalanced task categories, thereby enhancing the performance of classification models even with limited training data.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112641"},"PeriodicalIF":4.1000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121225003103","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Background:

Issue tracking systems (ITS) store project task data that is valuable for analytics and simulation. Projects typically include two types of tasks: stateful and recurring. While stateful tasks can be automatically categorized with relative ease, categorizing recurring tasks remains challenging. Prior research indicates that a key difficulty may lie in the underrepresentation of certain task types, which leads to severely imbalanced training datasets and hampers the accuracy of machine-learning models for task categorization.

Aims:

The goal of this study is to evaluate whether leveraging large language models (LLM) for data augmentation can enhance the machine-learning-based categorization of recurring tasks in software projects.

Method:

We conduct our study on a dataset from six industrial projects comprising 9,589 tasks. To address class imbalance, we up-sample minority classes during training via data augmentation using LLMs and several prompting strategies, assessing their impact on prediction quality. For each project, we perform time-series 5-fold cross-validation and evaluate the classifiers using state-of-the-art metrics — Accuracy, Precision, Recall, F1-score, and MCC — as well as practice-inspired metric called Monthly Classification Error (MCE) that assess the impact of task misclassification on project planning and resource allocation. Our machine-learning pipeline employs Transformer-based sentence embeddings and XGBoost classifiers.

Results:

The model automatically classifies software process tasks into 14 classes, achieving MCC values between 0.71 and 0.76. We observed higher prediction quality for the largest projects in the dataset and for those managed using “traditional” project management methodologies. Moreover, employing intra-project data augmentation strategies reduced the MCE error by up to 43%.

Conclusions:

Our findings indicate that large language models (LLMs) can be used to mitigate the impact of imbalanced task categories, thereby enhancing the performance of classification models even with limited training data.

查看原文本刊更多论文

利用基于llm的数据增强对软件开发项目中重复出现的任务进行自动分类

背景：问题跟踪系统（ITS）存储对分析和模拟有价值的项目任务数据。项目通常包括两种类型的任务：有状态任务和循环任务。虽然可以相对容易地对有状态任务进行自动分类，但对循环任务进行分类仍然具有挑战性。先前的研究表明，一个关键的困难可能在于某些任务类型的代表性不足，这导致训练数据集严重不平衡，阻碍了机器学习模型用于任务分类的准确性。目的：本研究的目的是评估利用大型语言模型（LLM）进行数据增强是否可以增强软件项目中基于机器学习的重复任务分类。方法：我们对包含9,589个任务的六个工业项目的数据集进行研究。为了解决类失衡问题，我们在训练过程中使用llm和几种提示策略，通过数据增强对少数类进行样本增加，评估它们对预测质量的影响。对于每个项目，我们执行时间序列5倍交叉验证，并使用最先进的指标-准确性，精度，召回率，f1分数和MCC -以及称为月度分类误差（MCE）的实践启发指标评估分类器，该指标评估任务错误分类对项目规划和资源分配的影响。我们的机器学习管道使用基于transformer的句子嵌入和XGBoost分类器。结果：该模型将软件过程任务自动分为14类，MCC值在0.71 ~ 0.76之间。我们观察到，对于数据集中最大的项目和使用“传统”项目管理方法管理的项目，预测质量更高。此外，采用项目内部数据增强策略将MCE误差降低了43%。结论：我们的研究结果表明，大型语言模型（llm）可以用来减轻任务类别不平衡的影响，从而在训练数据有限的情况下提高分类模型的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Systems and Software 工程技术-计算机：理论方法

CiteScore

8.60

自引率

5.70%

发文量

193

审稿时长

16 weeks

期刊介绍： The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to: •Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution •Agile, model-driven, service-oriented, open source and global software development •Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems •Human factors and management concerns of software development •Data management and big data issues of software systems •Metrics and evaluation, data mining of software development resources •Business and economic aspects of software development processes The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.