在不完整数据的时间限制下进行对成本敏感的分类

IF 2.1 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining Pub Date : 2024-06-25 DOI:10.1002/sam.11702

Yong‐Shiuan Lee, Chia‐Chi Wu

{"title":"在不完整数据的时间限制下进行对成本敏感的分类","authors":"Yong‐Shiuan Lee, Chia‐Chi Wu","doi":"10.1002/sam.11702","DOIUrl":null,"url":null,"abstract":"Missing values are common, but dealing with them by inappropriate method may lead to large classification errors. Empirical evidences show that the tree‐based classification algorithms such as classification and regression tree (CART) can benefit from imputation, especially multiple imputation. Nevertheless, less attention has been paid to incorporating multiple imputation into cost‐sensitive decision tree induction. This study focuses on the treatment of missing data based on a time‐constrained minimal‐cost tree algorithm. We introduce various approaches to handle incomplete data into the algorithm including complete‐case analysis, missing‐value branch, single imputation, feature acquisition, and multiple imputation. A simulation study under different scenarios examines the predictive performances of the proposed strategies. The simulation results show that the combination of the algorithm with multiple imputation can assure classification accuracy under the budget. A real medical data example provides insights into the problem of missing values in cost‐sensitive learning and the advantages of the proposed methods.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"110 1","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cost‐sensitive classification with time constraint on incomplete data\",\"authors\":\"Yong‐Shiuan Lee, Chia‐Chi Wu\",\"doi\":\"10.1002/sam.11702\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Missing values are common, but dealing with them by inappropriate method may lead to large classification errors. Empirical evidences show that the tree‐based classification algorithms such as classification and regression tree (CART) can benefit from imputation, especially multiple imputation. Nevertheless, less attention has been paid to incorporating multiple imputation into cost‐sensitive decision tree induction. This study focuses on the treatment of missing data based on a time‐constrained minimal‐cost tree algorithm. We introduce various approaches to handle incomplete data into the algorithm including complete‐case analysis, missing‐value branch, single imputation, feature acquisition, and multiple imputation. A simulation study under different scenarios examines the predictive performances of the proposed strategies. The simulation results show that the combination of the algorithm with multiple imputation can assure classification accuracy under the budget. A real medical data example provides insights into the problem of missing values in cost‐sensitive learning and the advantages of the proposed methods.\",\"PeriodicalId\":48684,\"journal\":{\"name\":\"Statistical Analysis and Data Mining\",\"volume\":\"110 1\",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2024-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistical Analysis and Data Mining\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1002/sam.11702\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Analysis and Data Mining","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1002/sam.11702","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

缺失值很常见，但用不恰当的方法处理缺失值可能会导致很大的分类误差。经验证据表明，基于树的分类算法，如分类和回归树（CART），可以从估算中获益，尤其是多重估算。然而，将多重归因纳入成本敏感决策树归纳法的研究却较少受到关注。本研究的重点是基于时间受限的最小成本树算法处理缺失数据。我们在算法中引入了多种处理不完整数据的方法，包括完整案例分析、缺失值分支、单一归因、特征获取和多重归因。在不同场景下进行的模拟研究检验了所提策略的预测性能。仿真结果表明，算法与多重归因的结合可以在预算范围内确保分类准确性。通过一个真实的医疗数据实例，我们可以深入了解成本敏感学习中的缺失值问题以及所提方法的优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cost‐sensitive classification with time constraint on incomplete data

Missing values are common, but dealing with them by inappropriate method may lead to large classification errors. Empirical evidences show that the tree‐based classification algorithms such as classification and regression tree (CART) can benefit from imputation, especially multiple imputation. Nevertheless, less attention has been paid to incorporating multiple imputation into cost‐sensitive decision tree induction. This study focuses on the treatment of missing data based on a time‐constrained minimal‐cost tree algorithm. We introduce various approaches to handle incomplete data into the algorithm including complete‐case analysis, missing‐value branch, single imputation, feature acquisition, and multiple imputation. A simulation study under different scenarios examines the predictive performances of the proposed strategies. The simulation results show that the combination of the algorithm with multiple imputation can assure classification accuracy under the budget. A real medical data example provides insights into the problem of missing values in cost‐sensitive learning and the advantages of the proposed methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Statistical Analysis and Data Mining COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCEC-COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

CiteScore

3.20

自引率

7.70%

发文量

期刊介绍： Statistical Analysis and Data Mining addresses the broad area of data analysis, including statistical approaches, machine learning, data mining, and applications. Topics include statistical and computational approaches for analyzing massive and complex datasets, novel statistical and/or machine learning methods and theory, and state-of-the-art applications with high impact. Of special interest are articles that describe innovative analytical techniques, and discuss their application to real problems, in such a way that they are accessible and beneficial to domain experts across science, engineering, and commerce. The focus of the journal is on papers which satisfy one or more of the following criteria: Solve data analysis problems associated with massive, complex datasets Develop innovative statistical approaches, machine learning algorithms, or methods integrating ideas across disciplines, e.g., statistics, computer science, electrical engineering, operation research. Formulate and solve high-impact real-world problems which challenge existing paradigms via new statistical and/or computational models Provide survey to prominent research topics.