MLAPW: A framework to assess the impact of feature selection and sampling techniques on anti-pattern prediction using WSDL metrics

IF 1.8 3区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Computer Languages Pub Date : 2025-02-01 DOI:10.1016/j.cola.2025.101322

Lov Kumar , Vikram Singh , Lalita Bhanu Murthy , Aneesh Krishna , Sanjay Misra

{"title":"MLAPW: A framework to assess the impact of feature selection and sampling techniques on anti-pattern prediction using WSDL metrics","authors":"Lov Kumar , Vikram Singh , Lalita Bhanu Murthy , Aneesh Krishna , Sanjay Misra","doi":"10.1016/j.cola.2025.101322","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>The quality and design of Service-Based Systems may be degraded because of frequent changes, and negatively impacts the software design quality called <strong>Anti-patterns</strong>. The existence of these Anti-patterns highly impacts the overall maintainability of Service-Based Systems. Hence, early detection of these anti-patterns’ presence becomes mandatory with co-located modifications. However, it is not easy to find these anti-patterns manually.</div></div><div><h3>Objective:</h3><div>The objective of this work is to explore the role of WSDL (Web Services Description Language) metrics (MLAPW) for anti-pattern prediction using a Machine Learning (ML) based framework. This framework encompasses different variants of feature selection techniques, data sampling techniques, and a wide range of ML algorithms. This work empirically investigates the predictive ability of anti-pattern prediction models developed using different sets of WSDL metrics. Our major focus is to investigate ’<em>how these metrics accurately predict different types of Anti-patterns present in the WSDL file</em>’.</div></div><div><h3>Methods:</h3><div>To achieve the objective, different sets of WSDL metrics such as Structural Quality Metrics, Procedural Quality Metrics, Data Quality Metrics, Quality Metrics, and Complexity metrics, are used as input for Anti-patterns prediction models. Since these models use WSDL metrics as input, we have also used feature selection methods to find the best sets of WSDL metrics. These models are trained using various machine-learning techniques. This study also shows the performance of these models trained on balanced data using data sampling techniques. Finally, the empirical investigation of these techniques was done using accuracy and ROC (receiver operating characteristic curve) curve (AUC) with hypothesis testing.</div></div><div><h3>Results:</h3><div>The empirical study’s observation is based on 226 WSDL files from various domains such as finance, tourism, health, education, etc. The assessment asserts that the models trained using WSDL metrics have 0.79 mean AUC and 0.90 Median AUC. However, the models trained using the selected feature with classifier feature subset selection (CFS) have a better mean AUC of 0.80 and median AUC of 0.97. The experimental results also confirm that the models trained on up-sampling (UPSAM) have a better mean AUC of 0.79 and median AUC of 0.91 with a low value of Friedman rank of 2.40. Finally, the models trained using the least square support vector machine (LSSVM) achieved 1 median AUC, 0.99 mean AUC, and a low Friedman rank of 1.30.</div></div><div><h3>Conclusion:</h3><div>The experimental results show that the AUC values of the models trained using Data and Procedural Quality Metrics are high as compared to the other sets of metrics. However, the models improved significantly in their prediction performance after employing feature selection techniques. The experimental results also show that the models trained using the advanced level of classifiers and ensemble learning have a higher value of AUC than other techniques. Based on this research, it is reasonable to claim that using data sampling techniques helps to improve the models’ prediction capability. The models trained on sampled data using UPSAM or up-sampling achieved 0.91 medians AUC and 0.79 average AUC.</div></div>","PeriodicalId":48552,"journal":{"name":"Journal of Computer Languages","volume":"83 ","pages":"Article 101322"},"PeriodicalIF":1.8000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Languages","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590118425000085","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Context:

The quality and design of Service-Based Systems may be degraded because of frequent changes, and negatively impacts the software design quality called Anti-patterns. The existence of these Anti-patterns highly impacts the overall maintainability of Service-Based Systems. Hence, early detection of these anti-patterns’ presence becomes mandatory with co-located modifications. However, it is not easy to find these anti-patterns manually.

Objective:

The objective of this work is to explore the role of WSDL (Web Services Description Language) metrics (MLAPW) for anti-pattern prediction using a Machine Learning (ML) based framework. This framework encompasses different variants of feature selection techniques, data sampling techniques, and a wide range of ML algorithms. This work empirically investigates the predictive ability of anti-pattern prediction models developed using different sets of WSDL metrics. Our major focus is to investigate ’how these metrics accurately predict different types of Anti-patterns present in the WSDL file’.

Methods:

To achieve the objective, different sets of WSDL metrics such as Structural Quality Metrics, Procedural Quality Metrics, Data Quality Metrics, Quality Metrics, and Complexity metrics, are used as input for Anti-patterns prediction models. Since these models use WSDL metrics as input, we have also used feature selection methods to find the best sets of WSDL metrics. These models are trained using various machine-learning techniques. This study also shows the performance of these models trained on balanced data using data sampling techniques. Finally, the empirical investigation of these techniques was done using accuracy and ROC (receiver operating characteristic curve) curve (AUC) with hypothesis testing.

Results:

The empirical study’s observation is based on 226 WSDL files from various domains such as finance, tourism, health, education, etc. The assessment asserts that the models trained using WSDL metrics have 0.79 mean AUC and 0.90 Median AUC. However, the models trained using the selected feature with classifier feature subset selection (CFS) have a better mean AUC of 0.80 and median AUC of 0.97. The experimental results also confirm that the models trained on up-sampling (UPSAM) have a better mean AUC of 0.79 and median AUC of 0.91 with a low value of Friedman rank of 2.40. Finally, the models trained using the least square support vector machine (LSSVM) achieved 1 median AUC, 0.99 mean AUC, and a low Friedman rank of 1.30.

Conclusion:

The experimental results show that the AUC values of the models trained using Data and Procedural Quality Metrics are high as compared to the other sets of metrics. However, the models improved significantly in their prediction performance after employing feature selection techniques. The experimental results also show that the models trained using the advanced level of classifiers and ensemble learning have a higher value of AUC than other techniques. Based on this research, it is reasonable to claim that using data sampling techniques helps to improve the models’ prediction capability. The models trained on sampled data using UPSAM or up-sampling achieved 0.91 medians AUC and 0.79 average AUC.

查看原文本刊更多论文

MLAPW：一个框架，用于评估特征选择和抽样技术对使用WSDL度量的反模式预测的影响

上下文：由于频繁的更改，基于服务的系统的质量和设计可能会下降，并对软件设计质量产生负面影响，称为反模式。这些反模式的存在严重影响了基于服务的系统的整体可维护性。因此，对这些反模式的存在进行早期检测是必须的。然而，手动查找这些反模式并不容易。目的：这项工作的目的是探索WSDL （Web服务描述语言）度量（MLAPW）在使用基于机器学习（ML）的框架进行反模式预测中的作用。该框架包含了特征选择技术、数据采样技术和广泛的ML算法的不同变体。这项工作对使用不同的WSDL度量集开发的反模式预测模型的预测能力进行了实证研究。我们的主要焦点是研究“这些指标如何准确地预测WSDL文件中出现的不同类型的反模式”。方法：为了实现目标，使用不同的WSDL度量集，如结构质量度量、过程质量度量、数据质量度量、质量度量和复杂性度量，作为反模式预测模型的输入。由于这些模型使用WSDL度量作为输入，我们还使用特征选择方法来找到最佳的WSDL度量集。这些模型使用各种机器学习技术进行训练。本研究还展示了使用数据采样技术在平衡数据上训练的这些模型的性能。最后，运用准确度、受试者工作特征曲线（ROC）曲线（AUC）和假设检验对这些技术进行实证研究。结果：实证研究的观察结果基于226个来自金融、旅游、卫生、教育等各个领域的WSDL文件。评估断言使用WSDL指标训练的模型具有0.79的平均AUC和0.90的中位数AUC。然而，使用分类器特征子集选择（CFS）训练的模型具有更好的平均AUC为0.80，中位数AUC为0.97。实验结果还证实，上采样（UPSAM）训练的模型具有较好的平均AUC为0.79，中位数AUC为0.91，Friedman rank值较低为2.40。最后，使用最小二乘支持向量机（LSSVM）训练的模型实现了中位AUC 1，平均AUC 0.99， Friedman rank低至1.30。结论：实验结果表明，与其他度量集相比，使用数据和程序质量度量集训练的模型的AUC值较高。然而，在采用特征选择技术后，模型的预测性能显著提高。实验结果还表明，使用高级分类器和集成学习训练的模型具有比其他技术更高的AUC值。基于本研究，我们有理由认为使用数据采样技术有助于提高模型的预测能力。使用UPSAM或上采样对采样数据进行训练的模型实现了0.91中位数AUC和0.79平均AUC。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Computer Languages Computer Science-Computer Networks and Communications

CiteScore

5.00

自引率

13.60%

发文量