{"title":"掌握罕见事件分析:在Cox和逻辑回归中确定子样本大小。","authors":"Tal Agassi, Nir Keret, Malka Gorfine","doi":"10.1093/biomtc/ujaf110","DOIUrl":null,"url":null,"abstract":"<p><p>In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the subsample size. To bridge this gap, our work introduces tools designed for choosing the subsample size. We focus on three settings: the Cox regression model for survival data with rare events, and logistic regression for both balanced and imbalanced datasets. Additionally, we present a new optimal subsampling procedure tailored to logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets: survival analysis of UK Biobank colorectal cancer data with about 350 million rows and logistic regression of linked birth and infant death data with about 28 million observations.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mastering rare event analysis: subsample-size determination in Cox and logistic regressions.\",\"authors\":\"Tal Agassi, Nir Keret, Malka Gorfine\",\"doi\":\"10.1093/biomtc/ujaf110\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the subsample size. To bridge this gap, our work introduces tools designed for choosing the subsample size. We focus on three settings: the Cox regression model for survival data with rare events, and logistic regression for both balanced and imbalanced datasets. Additionally, we present a new optimal subsampling procedure tailored to logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets: survival analysis of UK Biobank colorectal cancer data with about 350 million rows and logistic regression of linked birth and infant death data with about 28 million observations.</p>\",\"PeriodicalId\":8930,\"journal\":{\"name\":\"Biometrics\",\"volume\":\"81 3\",\"pages\":\"\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2025-07-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biometrics\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1093/biomtc/ujaf110\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biometrics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1093/biomtc/ujaf110","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOLOGY","Score":null,"Total":0}
Mastering rare event analysis: subsample-size determination in Cox and logistic regressions.
In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the subsample size. To bridge this gap, our work introduces tools designed for choosing the subsample size. We focus on three settings: the Cox regression model for survival data with rare events, and logistic regression for both balanced and imbalanced datasets. Additionally, we present a new optimal subsampling procedure tailored to logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets: survival analysis of UK Biobank colorectal cancer data with about 350 million rows and logistic regression of linked birth and infant death data with about 28 million observations.
期刊介绍:
The International Biometric Society is an international society promoting the development and application of statistical and mathematical theory and methods in the biosciences, including agriculture, biomedical science and public health, ecology, environmental sciences, forestry, and allied disciplines. The Society welcomes as members statisticians, mathematicians, biological scientists, and others devoted to interdisciplinary efforts in advancing the collection and interpretation of information in the biosciences. The Society sponsors the biennial International Biometric Conference, held in sites throughout the world; through its National Groups and Regions, it also Society sponsors regional and local meetings.