Robust Transfer Learning for High-Dimensional GLM Using γ $$ \gamma $$ -Divergence With Applications to Cancer Genomics.

IF 1.8 4区医学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Statistics in Medicine Pub Date : 2025-07-01 DOI:10.1002/sim.70170

Fuzhi Xu, Shuangge Ma, Qingzhao Zhang, Yaqing Xu

{"title":"<ArticleTitle xmlns:ns0=\"http://www.w3.org/1998/Math/MathML\">Robust Transfer Learning for High-Dimensional GLM Using <ns0:math> <ns0:semantics><ns0:mrow><ns0:mi>γ</ns0:mi></ns0:mrow> <ns0:annotation>$$ \\gamma $$</ns0:annotation></ns0:semantics> </ns0:math> -Divergence With Applications to Cancer Genomics.","authors":"Fuzhi Xu, Shuangge Ma, Qingzhao Zhang, Yaqing Xu","doi":"10.1002/sim.70170","DOIUrl":null,"url":null,"abstract":"<p><p>In the analysis of complex diseases, high-dimensional profiling data is important for assessing risks and detecting biomarkers. With the increasing accessibility of cancer genomic data, the sample sizes remain limited in most studies. Hence, borrowing information from additional data sources is thus desirable to improve estimation and prediction. Transfer learning has been demonstrated to be flexible and effective in boosting modeling performance with a record in biomedical applications. In practice, outliers and even data contamination often occur. However, existing transfer learning methods often lack robustness to outliers and data contamination, issues commonly observed in real-world biomedical data. In this study, we propose a robust transfer learning approach based on the minimum <math> <semantics><mrow><mi>γ</mi></mrow> <annotation>$$ \\gamma $$</annotation></semantics> </math> -divergence under a generalized linear model (GLM) framework for high-dimensional data. Our method incorporates a data-driven source detection scheme that automatically identifies informative sources while mitigating the risk of negative transfer. We establish rigorous theoretical results, including consistency and high-dimensional estimation error bounds, ensuring robustness and reliable performance. A computationally efficient algorithm is developed based on proximal gradient descent to facilitate both the transfer and debiasing steps. Simulation demonstrates the superior and competitive performance of the proposed approach in selection and prediction/classification. We further validate its practical utility by analyzing data on breast cancer and glioblastoma, showcasing the method's effectiveness in real-world high-dimensional settings.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"44 15-17","pages":"e70170"},"PeriodicalIF":1.8000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12313224/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/sim.70170","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

In the analysis of complex diseases, high-dimensional profiling data is important for assessing risks and detecting biomarkers. With the increasing accessibility of cancer genomic data, the sample sizes remain limited in most studies. Hence, borrowing information from additional data sources is thus desirable to improve estimation and prediction. Transfer learning has been demonstrated to be flexible and effective in boosting modeling performance with a record in biomedical applications. In practice, outliers and even data contamination often occur. However, existing transfer learning methods often lack robustness to outliers and data contamination, issues commonly observed in real-world biomedical data. In this study, we propose a robust transfer learning approach based on the minimum $γ$ -divergence under a generalized linear model (GLM) framework for high-dimensional data. Our method incorporates a data-driven source detection scheme that automatically identifies informative sources while mitigating the risk of negative transfer. We establish rigorous theoretical results, including consistency and high-dimensional estimation error bounds, ensuring robustness and reliable performance. A computationally efficient algorithm is developed based on proximal gradient descent to facilitate both the transfer and debiasing steps. Simulation demonstrates the superior and competitive performance of the proposed approach in selection and prediction/classification. We further validate its practical utility by analyzing data on breast cancer and glioblastoma, showcasing the method's effectiveness in real-world high-dimensional settings.

查看原文本刊更多论文

使用γ $$ \gamma $$的高维GLM鲁棒迁移学习-发散与癌症基因组学的应用。

在复杂疾病的分析中，高维谱数据对于评估风险和检测生物标志物非常重要。随着癌症基因组数据的可获取性的增加，大多数研究的样本量仍然有限。因此，需要从其他数据源借用信息来改进估计和预测。迁移学习已被证明是灵活和有效的提高建模性能与记录在生物医学应用。在实践中，经常会出现异常值甚至数据污染。然而，现有的迁移学习方法往往缺乏对异常值和数据污染的鲁棒性，这是现实世界生物医学数据中常见的问题。在这项研究中，我们提出了一种基于广义线性模型（GLM）框架下的最小γ $$ \gamma $$ -散度的鲁棒迁移学习方法。我们的方法结合了一个数据驱动的源检测方案，该方案可以自动识别信息源，同时降低负传输的风险。我们建立了严格的理论结果，包括一致性和高维估计误差界限，确保了鲁棒性和可靠的性能。提出了一种基于近端梯度下降的计算效率高的算法，以方便转移和去偏步骤。仿真结果表明，该方法在选择和预测/分类方面具有较强的竞争力。我们通过分析乳腺癌和胶质母细胞瘤的数据进一步验证了该方法的实用性，展示了该方法在现实世界高维环境中的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Statistics in Medicine 医学-公共卫生、环境卫生与职业卫生

CiteScore

3.40

自引率

10.00%

发文量

334

审稿时长

2-4 weeks

期刊介绍： The journal aims to influence practice in medicine and its associated sciences through the publication of papers on statistical and other quantitative methods. Papers will explain new methods and demonstrate their application, preferably through a substantive, real, motivating example or a comprehensive evaluation based on an illustrative example. Alternatively, papers will report on case-studies where creative use or technical generalizations of established methodology is directed towards a substantive application. Reviews of, and tutorials on, general topics relevant to the application of statistics to medicine will also be published. The main criteria for publication are appropriateness of the statistical methods to a particular medical problem and clarity of exposition. Papers with primarily mathematical content will be excluded. The journal aims to enhance communication between statisticians, clinicians and medical researchers.