{"title":"Optimal Subdata Selection for Prediction Based on the Distribution of the Covariates","authors":"Alvaro Cia-Mina;Jesus Lopez-Fidalgo;Weng Kee Wong","doi":"10.1109/TBDATA.2025.3552343","DOIUrl":null,"url":null,"abstract":"Huge data sets are widely available now and there is growing interest in selecting an optimal subsample from the full data set to improve inference efficiency and reduce labeling costs. We propose a new criterion called J–optimality, that builds upon a popular optimal selection criterion that minimizes the Random–X prediction error by additionally incorporating the joint distribution of the covariates. A key advantage of our approach is that we can relate the subsampling selection problem to that of finding an optimal approximate design under a convex criterion, where analytical tools for finding and studying them are already available. Consequently, the J–optimal subsampling method comes with theoretical results and theory-based algorithms for finding them. Simulation results and real data analysis show our proposed methods outperform current subsampling methods and the proposed algorithms can also adapt efficiently to select an optimal subsample from streaming data.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 5","pages":"2601-2614"},"PeriodicalIF":5.7000,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10930599","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10930599/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Huge data sets are widely available now and there is growing interest in selecting an optimal subsample from the full data set to improve inference efficiency and reduce labeling costs. We propose a new criterion called J–optimality, that builds upon a popular optimal selection criterion that minimizes the Random–X prediction error by additionally incorporating the joint distribution of the covariates. A key advantage of our approach is that we can relate the subsampling selection problem to that of finding an optimal approximate design under a convex criterion, where analytical tools for finding and studying them are already available. Consequently, the J–optimal subsampling method comes with theoretical results and theory-based algorithms for finding them. Simulation results and real data analysis show our proposed methods outperform current subsampling methods and the proposed algorithms can also adapt efficiently to select an optimal subsample from streaming data.
期刊介绍:
The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.