{"title":"Learning Balanced Bayesian Classifiers From Labeled and Unlabeled Data","authors":"Lu Guo;Limin Wang;Qilong Li;Kuo Li","doi":"10.1109/TBDATA.2023.3338019","DOIUrl":null,"url":null,"abstract":"How to train learners over unbalanced data with asymmetric costs has been recognized as one of the most significant challenges in data mining. Bayesian network classifier (BNC) provides a powerful probabilistic tool to encode the probabilistic dependencies among random variables in directed acyclic graph (DAG), whereas unbalanced data will result in unbalanced network topology. This will lead to a biased estimate of the conditional or joint probability distribution, and finally a reduction in the classification accuracy. To address this issue, we propose to redefine the information-theoretic metrics to uniformly represent the balanced dependencies between attributes or that between attribute values. Then heuristic search strategy and thresholding operation are introduced to respectively learn refined DAGs from labeled and unlabeled data. The experimental results on 32 benchmark datasets reveal that the proposed highly scalable algorithm is competitive with or superior to a number of state-of-the-art single and ensemble learners.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"330-342"},"PeriodicalIF":7.5000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10336381/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
How to train learners over unbalanced data with asymmetric costs has been recognized as one of the most significant challenges in data mining. Bayesian network classifier (BNC) provides a powerful probabilistic tool to encode the probabilistic dependencies among random variables in directed acyclic graph (DAG), whereas unbalanced data will result in unbalanced network topology. This will lead to a biased estimate of the conditional or joint probability distribution, and finally a reduction in the classification accuracy. To address this issue, we propose to redefine the information-theoretic metrics to uniformly represent the balanced dependencies between attributes or that between attribute values. Then heuristic search strategy and thresholding operation are introduced to respectively learn refined DAGs from labeled and unlabeled data. The experimental results on 32 benchmark datasets reveal that the proposed highly scalable algorithm is competitive with or superior to a number of state-of-the-art single and ensemble learners.
期刊介绍:
The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.