Making data classification more effective: An automated deep forest model

IF 10.4 1区计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Industrial Information Integration Pub Date : 2024-11-01 DOI:10.1016/j.jii.2024.100738

Jingwei Guo , Xiang Guo , Yihui Tian , Hao Zhan , Zhen-Song Chen , Muhammet Deveci

{"title":"Making data classification more effective: An automated deep forest model","authors":"Jingwei Guo , Xiang Guo , Yihui Tian , Hao Zhan , Zhen-Song Chen , Muhammet Deveci","doi":"10.1016/j.jii.2024.100738","DOIUrl":null,"url":null,"abstract":"<div><div>Despite a small overfitting risk, the deep forest model and its variants cannot automatically match data features; they rely on manual experience and comparative experiments for forest learner selection. This study proposes an automated deep forest model (ATDF) to enhance deep forest automation by automatically determining forest learners’ types and numbers based on training data. The model introduces a forest learner variability measure based on normalized mutual information, serving as a theoretical foundation for the automated process in deep forests. Then, a novel hierarchical clustering algorithm based on normalized mutual information is proposed to group forest learners at different granularities, determining the optimal forest learner type. This advanced technical method enables the determination of the model structure for stacking models, including deep forests. Finally, with the goal of maximizing cross-validation scores, the tree parson estimator-based Bayesian optimization algorithm determines the ideal number of forest learners for each type. Additionally, a standardized method for identifying forest learners is developed to guarantee the consistency of model outcomes. Most importantly, a series of comparative experiments on seven datasets from the UCI Machine Learning Repository confirmed the effectiveness and superiority of the proposed model. The results demonstrate that the proposed model has superior adaptability to new data and tasks, besides having a high level of automation, and performs excellently in the classification task.</div></div>","PeriodicalId":55975,"journal":{"name":"Journal of Industrial Information Integration","volume":"42 ","pages":"Article 100738"},"PeriodicalIF":10.4000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Industrial Information Integration","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2452414X2400181X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Despite a small overfitting risk, the deep forest model and its variants cannot automatically match data features; they rely on manual experience and comparative experiments for forest learner selection. This study proposes an automated deep forest model (ATDF) to enhance deep forest automation by automatically determining forest learners’ types and numbers based on training data. The model introduces a forest learner variability measure based on normalized mutual information, serving as a theoretical foundation for the automated process in deep forests. Then, a novel hierarchical clustering algorithm based on normalized mutual information is proposed to group forest learners at different granularities, determining the optimal forest learner type. This advanced technical method enables the determination of the model structure for stacking models, including deep forests. Finally, with the goal of maximizing cross-validation scores, the tree parson estimator-based Bayesian optimization algorithm determines the ideal number of forest learners for each type. Additionally, a standardized method for identifying forest learners is developed to guarantee the consistency of model outcomes. Most importantly, a series of comparative experiments on seven datasets from the UCI Machine Learning Repository confirmed the effectiveness and superiority of the proposed model. The results demonstrate that the proposed model has superior adaptability to new data and tasks, besides having a high level of automation, and performs excellently in the classification task.

查看原文本刊更多论文

让数据分类更有效自动深度森林模型

尽管过拟合风险较小，但深度森林模型及其变体无法自动匹配数据特征；它们依赖人工经验和对比实验来选择森林学习器。本研究提出了一种自动深度森林模型（ATDF），通过根据训练数据自动确定森林学习器的类型和数量来提高深度森林的自动化程度。该模型引入了基于归一化互信息的森林学习器可变性度量，为深林自动化过程奠定了理论基础。然后，提出了一种基于归一化互信息的新型分层聚类算法，对不同粒度的森林学习者进行分组，从而确定最佳的森林学习者类型。通过这种先进的技术方法，可以确定堆叠模型（包括深林）的模型结构。最后，以交叉验证得分最大化为目标，基于树帕森估计器的贝叶斯优化算法确定了每种类型森林学习器的理想数量。此外，还开发了一种识别森林学习器的标准化方法，以保证模型结果的一致性。最重要的是，在加州大学洛杉矶分校机器学习资料库的七个数据集上进行的一系列对比实验证实了所提模型的有效性和优越性。实验结果表明，所提出的模型除了自动化程度高之外，还具有对新数据和新任务的超强适应性，在分类任务中表现出色。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Industrial Information Integration Decision Sciences-Information Systems and Management

CiteScore

22.30

自引率

13.40%

发文量

100

期刊介绍： The Journal of Industrial Information Integration focuses on the industry's transition towards industrial integration and informatization, covering not only hardware and software but also information integration. It serves as a platform for promoting advances in industrial information integration, addressing challenges, issues, and solutions in an interdisciplinary forum for researchers, practitioners, and policy makers. The Journal of Industrial Information Integration welcomes papers on foundational, technical, and practical aspects of industrial information integration, emphasizing the complex and cross-disciplinary topics that arise in industrial integration. Techniques from mathematical science, computer science, computer engineering, electrical and electronic engineering, manufacturing engineering, and engineering management are crucial in this context.