A preordonance-based decision tree method and its parallel implementation in the framework of Map-Reduce

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Soft Computing Pub Date : 2024-09-23 DOI:10.1016/j.asoc.2024.112261

Hasna Chamlal , Fadwa Aaboub , Tayeb Ouaderhman

{"title":"A preordonance-based decision tree method and its parallel implementation in the framework of Map-Reduce","authors":"Hasna Chamlal , Fadwa Aaboub , Tayeb Ouaderhman","doi":"10.1016/j.asoc.2024.112261","DOIUrl":null,"url":null,"abstract":"<div><div>In supervised classification, decision trees are one of the most popular learning algorithms that are employed in many practical applications because of their simplicity, adaptability, and other perks. The development of effective and efficient decision trees remains a major focus in machine learning. Therefore, the scientific literature provides various node splitting measurements that can be utilized to produce different decision trees, including Information Gain, Gain Ratio, Average Gain, and Gini Index. This research paper presents a new node splitting metric that is based on preordonance theory. The primary benefit of the new split criterion is its ability to deal with categorical or numerical attributes directly without discretization. Consequently, the Preordonance-based decision tree” (P-Tree) approach, a powerful technique that generates decision trees using the suggested node splitting measure, is developed. Both multiclass classification problems and imbalanced data sets can be handled by the P-Tree decision tree strategy. Moreover, the over-partitioning problem is addressed by the P-Tree methodology, which introduces a threshold <span><math><mi>ϵ</mi></math></span> as a stopping condition. If the percentage of instances in a node falls below the predetermined threshold, the expansion of the tree will be halted. The performance of the P-Tree procedure is evaluated on fourteen benchmark data sets with different sizes and contrasted with that of five already existing decision tree methods using a variety of evaluation metrics. The results of the experiments demonstrate that the P-Tree model performs admirably across all of the tested data sets and that it is comparable to the other five decision tree algorithms overall. On the other hand, an ensemble technique called “ensemble P-Tree” offers a reliable remedy to mitigate the instability that is frequently associated with tree-based algorithms. This ensemble method leverages the strengths of the P-Tree approach to enhance predictive performance through collective decision-making. The ensemble P-Tree strategy is comprehensively evaluated by comparing its performance to that of two top-performing ensemble decision tree methodologies. The experimental findings highlight its exceptional performance and competitiveness against other decision tree procedures. Despite the excellent performance of the P-Tree approach, there are still some obstacles that prevent it from handling larger data sets, such as memory restrictions, time complexity, or data complexity. However, parallel computing is effective in resolving this kind of problem. Hence, the MR-P-Tree decision tree technique, a parallel implementation of the P-Tree strategy in the Map-Reduce framework, is further designed. The three parallel procedures MR-SA-S, MR-SP-S, and MR-S-DS for choosing the optimal splitting attributes, choosing the optimal splitting points, and dividing the training data set in parallel, respectively, are the primary basis of the MR-P-Tree methodology. Furthermore, several experimental studies are carried out on ten additional data sets to illustrate the viability of the MR-P-Tree technique and its strong parallel performance.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":null,"pages":null},"PeriodicalIF":7.2000,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494624010354","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In supervised classification, decision trees are one of the most popular learning algorithms that are employed in many practical applications because of their simplicity, adaptability, and other perks. The development of effective and efficient decision trees remains a major focus in machine learning. Therefore, the scientific literature provides various node splitting measurements that can be utilized to produce different decision trees, including Information Gain, Gain Ratio, Average Gain, and Gini Index. This research paper presents a new node splitting metric that is based on preordonance theory. The primary benefit of the new split criterion is its ability to deal with categorical or numerical attributes directly without discretization. Consequently, the Preordonance-based decision tree” (P-Tree) approach, a powerful technique that generates decision trees using the suggested node splitting measure, is developed. Both multiclass classification problems and imbalanced data sets can be handled by the P-Tree decision tree strategy. Moreover, the over-partitioning problem is addressed by the P-Tree methodology, which introduces a threshold

ϵ

as a stopping condition. If the percentage of instances in a node falls below the predetermined threshold, the expansion of the tree will be halted. The performance of the P-Tree procedure is evaluated on fourteen benchmark data sets with different sizes and contrasted with that of five already existing decision tree methods using a variety of evaluation metrics. The results of the experiments demonstrate that the P-Tree model performs admirably across all of the tested data sets and that it is comparable to the other five decision tree algorithms overall. On the other hand, an ensemble technique called “ensemble P-Tree” offers a reliable remedy to mitigate the instability that is frequently associated with tree-based algorithms. This ensemble method leverages the strengths of the P-Tree approach to enhance predictive performance through collective decision-making. The ensemble P-Tree strategy is comprehensively evaluated by comparing its performance to that of two top-performing ensemble decision tree methodologies. The experimental findings highlight its exceptional performance and competitiveness against other decision tree procedures. Despite the excellent performance of the P-Tree approach, there are still some obstacles that prevent it from handling larger data sets, such as memory restrictions, time complexity, or data complexity. However, parallel computing is effective in resolving this kind of problem. Hence, the MR-P-Tree decision tree technique, a parallel implementation of the P-Tree strategy in the Map-Reduce framework, is further designed. The three parallel procedures MR-SA-S, MR-SP-S, and MR-S-DS for choosing the optimal splitting attributes, choosing the optimal splitting points, and dividing the training data set in parallel, respectively, are the primary basis of the MR-P-Tree methodology. Furthermore, several experimental studies are carried out on ten additional data sets to illustrate the viability of the MR-P-Tree technique and its strong parallel performance.

查看原文本刊更多论文

基于前因果关系的决策树方法及其在 Map-Reduce 框架内的并行实施

在监督分类中，决策树是最流行的学习算法之一，由于其简单、适应性强等优点，在许多实际应用中都被采用。开发有效且高效的决策树仍然是机器学习领域的一大重点。因此，科学文献提供了各种节点拆分测量方法，可用于生成不同的决策树，包括信息增益、增益比、平均增益和基尼指数。本研究论文提出了一种基于前导理论的新节点拆分标准。新拆分标准的主要优点是能够直接处理分类或数字属性，而无需进行离散化处理。因此，本文提出了 "基于前相关性的决策树"（P-Tree）方法，这是一种利用所建议的节点分裂度量生成决策树的强大技术。P-Tree 决策树策略可以处理多类分类问题和不平衡数据集。此外，P-Tree 方法还能解决过度分割问题，它引入了一个阈值ϵ 作为停止条件。如果节点中的实例百分比低于预定阈值，树的扩展就会停止。我们在 14 个不同规模的基准数据集上对 P-Tree 程序的性能进行了评估，并使用各种评估指标与现有的五种决策树方法进行了对比。实验结果表明，P-Tree 模型在所有测试数据集上都表现出色，总体上可与其他五种决策树算法相媲美。另一方面，一种名为 "集合 P-Tree "的集合技术提供了一种可靠的补救方法，以缓解基于树的算法经常出现的不稳定性。这种集合方法充分利用了 P-Tree 方法的优势，通过集体决策来提高预测性能。通过将集合 P-Tree 策略的性能与两种性能最佳的集合决策树方法进行比较，对其进行了全面评估。实验结果凸显了其卓越的性能以及与其他决策树程序相比的竞争力。尽管 P-Tree 方法性能卓越，但仍有一些障碍使其无法处理较大的数据集，如内存限制、时间复杂性或数据复杂性。不过，并行计算可以有效解决这类问题。因此，我们进一步设计了 MR-P-Tree 决策树技术，它是 P-Tree 策略在 Map-Reduce 框架中的并行实现。MR-SA-S、MR-SP-S 和 MR-S-DS 三个并行程序分别用于选择最佳分割属性、选择最佳分割点和并行分割训练数据集，是 MR-P-Tree 方法的主要基础。此外，还在另外十个数据集上进行了多项实验研究，以说明 MR-P-Tree 技术的可行性及其强大的并行性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Soft Computing 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

6.90%

发文量

874

审稿时长

10.9 months

期刊介绍： Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities. Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.