与hdPS框架中的Bross公式相比，使用多元统计或机器学习方法进行偏差和方差估计是否有竞争优势？

IF 2.6 3区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

PLoS ONE Pub Date : 2025-05-28 eCollection Date: 2025-01-01 DOI:10.1371/journal.pone.0324639

Mohammad Ehsanul Karim, Yang Lei

{"title":"与hdPS框架中的Bross公式相比，使用多元统计或机器学习方法进行偏差和方差估计是否有竞争优势？","authors":"Mohammad Ehsanul Karim, Yang Lei","doi":"10.1371/journal.pone.0324639","DOIUrl":null,"url":null,"abstract":"Purpose: We aim to evaluate various proxy selection methods within the context of high-dimensional propensity score (hdPS) analysis. This study aimed to systematically evaluate and compare the performance of traditional statistical methods and machine learning approaches within the hdPS framework, focusing on key metrics such as bias, standard error (SE), and coverage, under various exposure and outcome prevalence scenarios.Methods: We conducted a plasmode simulation study using data from the National Health and Nutrition Examination Survey (NHANES) cycles from 2013 to 2018. We compared methods including the kitchen sink model, Bross-based hdPS, Hybrid hdPS, LASSO, Elastic Net, Random Forest, XGBoost, and Genetic Algorithm (GA). The performance of each inverse probability weighted method was assessed based on bias, MSE, coverage probability, and SE estimation across three epidemiological scenarios: frequent exposure and outcome, rare exposure and frequent outcome, and frequent exposure and rare outcome.Results: XGBoost consistently demonstrated strong performance in terms of MSE and coverage, making it effective for scenarios prioritizing precision. However, it exhibited higher bias, particularly in rare exposure scenarios, suggesting it is less suited when minimizing bias is critical. In contrast, GA showed significant limitations, with consistently high bias and MSE, making it the least reliable method. Bross-based hdPS, and Hybrid hdPS methods provided a balanced approach, with low bias and moderate MSE, though coverage varied depending on the scenario. Rare outcome scenarios generally resulted in lower MSE and better precision, while rare exposure scenarios were associated with higher bias and MSE. Notably, traditional statistical approaches such as forward selection and backward elimination performed comparably to more sophisticated machine learning methods in terms of bias and coverage, suggesting that these simpler approaches may be viable alternatives due to their computational efficiency.Conclusion: The results highlight the importance of selecting hdPS methods based on the specific characteristics of the data, such as exposure and outcome prevalence. While advanced machine learning methods such as XGBoost can enhance precision, simpler methods such as forward selection or backward elimination may offer similar performance in terms of bias and coverage with fewer computational demands. Tailoring the choice of method to the epidemiological scenario is essential for optimizing the balance between bias reduction and precision.","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"20 5","pages":"e0324639"},"PeriodicalIF":2.6000,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12118903/pdf/","citationCount":"0","resultStr":"{\"title\":\"Is there a competitive advantage to using multivariate statistical or machine learning methods over the Bross formula in the hdPS framework for bias and variance estimation?\",\"authors\":\"Mohammad Ehsanul Karim, Yang Lei\",\"doi\":\"10.1371/journal.pone.0324639\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: We aim to evaluate various proxy selection methods within the context of high-dimensional propensity score (hdPS) analysis. This study aimed to systematically evaluate and compare the performance of traditional statistical methods and machine learning approaches within the hdPS framework, focusing on key metrics such as bias, standard error (SE), and coverage, under various exposure and outcome prevalence scenarios.Methods: We conducted a plasmode simulation study using data from the National Health and Nutrition Examination Survey (NHANES) cycles from 2013 to 2018. We compared methods including the kitchen sink model, Bross-based hdPS, Hybrid hdPS, LASSO, Elastic Net, Random Forest, XGBoost, and Genetic Algorithm (GA). The performance of each inverse probability weighted method was assessed based on bias, MSE, coverage probability, and SE estimation across three epidemiological scenarios: frequent exposure and outcome, rare exposure and frequent outcome, and frequent exposure and rare outcome.Results: XGBoost consistently demonstrated strong performance in terms of MSE and coverage, making it effective for scenarios prioritizing precision. However, it exhibited higher bias, particularly in rare exposure scenarios, suggesting it is less suited when minimizing bias is critical. In contrast, GA showed significant limitations, with consistently high bias and MSE, making it the least reliable method. Bross-based hdPS, and Hybrid hdPS methods provided a balanced approach, with low bias and moderate MSE, though coverage varied depending on the scenario. Rare outcome scenarios generally resulted in lower MSE and better precision, while rare exposure scenarios were associated with higher bias and MSE. Notably, traditional statistical approaches such as forward selection and backward elimination performed comparably to more sophisticated machine learning methods in terms of bias and coverage, suggesting that these simpler approaches may be viable alternatives due to their computational efficiency.Conclusion: The results highlight the importance of selecting hdPS methods based on the specific characteristics of the data, such as exposure and outcome prevalence. While advanced machine learning methods such as XGBoost can enhance precision, simpler methods such as forward selection or backward elimination may offer similar performance in terms of bias and coverage with fewer computational demands. Tailoring the choice of method to the epidemiological scenario is essential for optimizing the balance between bias reduction and precision.\",\"PeriodicalId\":20189,\"journal\":{\"name\":\"PLoS ONE\",\"volume\":\"20 5\",\"pages\":\"e0324639\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2025-05-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12118903/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PLoS ONE\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.1371/journal.pone.0324639\",\"RegionNum\":3,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0324639","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

目的：我们的目的是评估各种代理选择方法在高维倾向评分（hdPS）分析的背景下。本研究旨在系统地评估和比较hdPS框架内传统统计方法和机器学习方法的性能，重点关注不同暴露和结果流行情景下的偏差、标准误差（SE）和覆盖率等关键指标。方法：利用2013年至2018年国家健康与营养检查调查（NHANES）周期的数据进行plasmode模拟研究。我们比较了厨房水槽模型、基于bross的hdPS、Hybrid hdPS、LASSO、Elastic Net、Random Forest、XGBoost和遗传算法（GA）等方法。根据三种流行病学情景（频繁暴露和结果、罕见暴露和频繁结果、频繁暴露和罕见结果）的偏倚、MSE、覆盖概率和SE估计，对每种逆概率加权方法的性能进行评估。结果：XGBoost在MSE和覆盖率方面始终表现出强大的性能，使其在优先考虑精度的场景中有效。然而，它表现出更高的偏差，特别是在罕见的暴露情况下，这表明当最小化偏差至关重要时，它不太适合。相比之下，遗传算法显示出显著的局限性，具有一贯的高偏倚和MSE，使其成为最不可靠的方法。基于兄弟的hdPS和混合hdPS方法提供了一种平衡的方法，具有低偏差和中等MSE，尽管覆盖范围因场景而异。罕见的结局情景通常导致较低的MSE和较好的精度，而罕见的暴露情景与较高的偏倚和MSE相关。值得注意的是，传统的统计方法，如前向选择和后向消除，在偏差和覆盖方面的表现与更复杂的机器学习方法相当，这表明这些更简单的方法可能是可行的替代方案，因为它们的计算效率更高。结论：该结果强调了根据数据的具体特征（如暴露和结局患病率）选择hdPS方法的重要性。虽然先进的机器学习方法（如XGBoost）可以提高精度，但更简单的方法（如正向选择或向后消除）可能在偏差和覆盖范围方面提供类似的性能，并且计算需求更少。根据流行病学情景调整方法选择对于优化减少偏倚和精确之间的平衡至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Is there a competitive advantage to using multivariate statistical or machine learning methods over the Bross formula in the hdPS framework for bias and variance estimation?

查看原文本刊更多论文

Is there a competitive advantage to using multivariate statistical or machine learning methods over the Bross formula in the hdPS framework for bias and variance estimation?

Purpose: We aim to evaluate various proxy selection methods within the context of high-dimensional propensity score (hdPS) analysis. This study aimed to systematically evaluate and compare the performance of traditional statistical methods and machine learning approaches within the hdPS framework, focusing on key metrics such as bias, standard error (SE), and coverage, under various exposure and outcome prevalence scenarios.

Methods: We conducted a plasmode simulation study using data from the National Health and Nutrition Examination Survey (NHANES) cycles from 2013 to 2018. We compared methods including the kitchen sink model, Bross-based hdPS, Hybrid hdPS, LASSO, Elastic Net, Random Forest, XGBoost, and Genetic Algorithm (GA). The performance of each inverse probability weighted method was assessed based on bias, MSE, coverage probability, and SE estimation across three epidemiological scenarios: frequent exposure and outcome, rare exposure and frequent outcome, and frequent exposure and rare outcome.

Results: XGBoost consistently demonstrated strong performance in terms of MSE and coverage, making it effective for scenarios prioritizing precision. However, it exhibited higher bias, particularly in rare exposure scenarios, suggesting it is less suited when minimizing bias is critical. In contrast, GA showed significant limitations, with consistently high bias and MSE, making it the least reliable method. Bross-based hdPS, and Hybrid hdPS methods provided a balanced approach, with low bias and moderate MSE, though coverage varied depending on the scenario. Rare outcome scenarios generally resulted in lower MSE and better precision, while rare exposure scenarios were associated with higher bias and MSE. Notably, traditional statistical approaches such as forward selection and backward elimination performed comparably to more sophisticated machine learning methods in terms of bias and coverage, suggesting that these simpler approaches may be viable alternatives due to their computational efficiency.

Conclusion: The results highlight the importance of selecting hdPS methods based on the specific characteristics of the data, such as exposure and outcome prevalence. While advanced machine learning methods such as XGBoost can enhance precision, simpler methods such as forward selection or backward elimination may offer similar performance in terms of bias and coverage with fewer computational demands. Tailoring the choice of method to the epidemiological scenario is essential for optimizing the balance between bias reduction and precision.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

PLoS ONE 生物-生物学

CiteScore

6.20

自引率

5.40%

发文量

14242

审稿时长

3.7 months

期刊介绍： PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides: * Open-access—freely accessible online, authors retain copyright * Fast publication times * Peer review by expert, practicing researchers * Post-publication tools to indicate quality and impact * Community-based dialogue on articles * Worldwide media coverage