Distributed non-disclosive validation of predictive models by a modified ROC-GLM.

IF 3.9 3区医学 Q1 HEALTH CARE SCIENCES & SERVICES

BMC Medical Research Methodology Pub Date : 2024-08-29 DOI:10.1186/s12874-024-02312-4

Daniel Schalk, Raphael Rehms, Verena S Hoffmann, Bernd Bischl, Ulrich Mansmann

{"title":"Distributed non-disclosive validation of predictive models by a modified ROC-GLM.","authors":"Daniel Schalk, Raphael Rehms, Verena S Hoffmann, Bernd Bischl, Ulrich Mansmann","doi":"10.1186/s12874-024-02312-4","DOIUrl":null,"url":null,"abstract":"Background: Distributed statistical analyses provide a promising approach for privacy protection when analyzing data distributed over several databases. Instead of directly operating on data, the analyst receives anonymous summary statistics, which are combined into an aggregated result. Further, in discrimination model (prognosis, diagnosis, etc.) development, it is key to evaluate a trained model w.r.t. to its prognostic or predictive performance on new independent data. For binary classification, quantifying discrimination uses the receiver operating characteristics (ROC) and its area under the curve (AUC) as aggregation measure. We are interested to calculate both as well as basic indicators of calibration-in-the-large for a binary classification task using a distributed and privacy-preserving approach.Methods: We employ DataSHIELD as the technology to carry out distributed analyses, and we use a newly developed algorithm to validate the prediction score by conducting distributed and privacy-preserving ROC analysis. Calibration curves are constructed from mean values over sites. The determination of ROC and its AUC is based on a generalized linear model (GLM) approximation of the true ROC curve, the ROC-GLM, as well as on ideas of differential privacy (DP). DP adds noise (quantified by the <math><msub><mi>ℓ</mi> <mn>2</mn></msub> </math> sensitivity <math> <mrow><msub><mi>Δ</mi> <mn>2</mn></msub> <mrow><mo>(</mo> <mover><mi>f</mi> <mo>^</mo></mover> <mo>)</mo></mrow> </mrow> </math> ) to the data and enables a global handling of placement numbers. The impact of DP parameters was studied by simulations.Results: In our simulation scenario, the true and distributed AUC measures differ by <math><mrow><mi>Δ</mi> <mtext>AUC</mtext> <mo><</mo> <mn>0.01</mn></mrow> </math> depending heavily on the choice of the differential privacy parameters. It is recommended to check the accuracy of the distributed AUC estimator in specific simulation scenarios along with a reasonable choice of DP parameters. Here, the accuracy of the distributed AUC estimator may be impaired by too much artificial noise added from DP.Conclusions: The applicability of our algorithms depends on the <math><msub><mi>ℓ</mi> <mn>2</mn></msub> </math> sensitivity <math> <mrow><msub><mi>Δ</mi> <mn>2</mn></msub> <mrow><mo>(</mo> <mover><mi>f</mi> <mo>^</mo></mover> <mo>)</mo></mrow> </mrow> </math> of the underlying statistical/predictive model. The simulations carried out have shown that the approximation error is acceptable for the majority of simulated cases. For models with high <math> <mrow><msub><mi>Δ</mi> <mn>2</mn></msub> <mrow><mo>(</mo> <mover><mi>f</mi> <mo>^</mo></mover> <mo>)</mo></mrow> </mrow> </math> , the privacy parameters must be set accordingly higher to ensure sufficient privacy protection, which affects the approximation error. This work shows that complex measures, as the AUC, are applicable for validation in distributed setups while preserving an individual's privacy.","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"24 1","pages":"190"},"PeriodicalIF":3.9000,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11363434/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Research Methodology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12874-024-02312-4","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Distributed statistical analyses provide a promising approach for privacy protection when analyzing data distributed over several databases. Instead of directly operating on data, the analyst receives anonymous summary statistics, which are combined into an aggregated result. Further, in discrimination model (prognosis, diagnosis, etc.) development, it is key to evaluate a trained model w.r.t. to its prognostic or predictive performance on new independent data. For binary classification, quantifying discrimination uses the receiver operating characteristics (ROC) and its area under the curve (AUC) as aggregation measure. We are interested to calculate both as well as basic indicators of calibration-in-the-large for a binary classification task using a distributed and privacy-preserving approach.

Methods: We employ DataSHIELD as the technology to carry out distributed analyses, and we use a newly developed algorithm to validate the prediction score by conducting distributed and privacy-preserving ROC analysis. Calibration curves are constructed from mean values over sites. The determination of ROC and its AUC is based on a generalized linear model (GLM) approximation of the true ROC curve, the ROC-GLM, as well as on ideas of differential privacy (DP). DP adds noise (quantified by the $ℓ_{2}$ sensitivity $Δ_{2} (\hat{f})$ ) to the data and enables a global handling of placement numbers. The impact of DP parameters was studied by simulations.

Results: In our simulation scenario, the true and distributed AUC measures differ by $Δ AUC < 0.01$ depending heavily on the choice of the differential privacy parameters. It is recommended to check the accuracy of the distributed AUC estimator in specific simulation scenarios along with a reasonable choice of DP parameters. Here, the accuracy of the distributed AUC estimator may be impaired by too much artificial noise added from DP.

Conclusions: The applicability of our algorithms depends on the $ℓ_{2}$ sensitivity $Δ_{2} (\hat{f})$ of the underlying statistical/predictive model. The simulations carried out have shown that the approximation error is acceptable for the majority of simulated cases. For models with high $Δ_{2} (\hat{f})$ , the privacy parameters must be set accordingly higher to ensure sufficient privacy protection, which affects the approximation error. This work shows that complex measures, as the AUC, are applicable for validation in distributed setups while preserving an individual's privacy.

查看原文本刊更多论文

通过改进的 ROC-GLM 对预测模型进行分布式非披露验证。

背景：在分析分布在多个数据库中的数据时，分布式统计分析为隐私保护提供了一种可行的方法。分析人员不直接对数据进行操作，而是接收匿名的汇总统计数据，并将其合并为一个综合结果。此外，在开发判别模型（预后、诊断等）时，关键是评估训练有素的模型在新的独立数据上的预后或预测性能。对于二元分类，量化判别使用接收者操作特征（ROC）及其曲线下面积（AUC）作为集合度量。我们有兴趣采用分布式和保护隐私的方法，计算二元分类任务的这两个指标以及大校准的基本指标：方法：我们采用 DataSHIELD 作为进行分布式分析的技术，并使用一种新开发的算法，通过进行分布式和保护隐私的 ROC 分析来验证预测得分。校准曲线由各站点的平均值构建。ROC 及其 AUC 的确定基于真实 ROC 曲线的广义线性模型 (GLM) 近似值，即 ROC-GLM，以及差分隐私 (DP) 思想。DP 增加了数据中的噪声（通过 ℓ 2 敏感度 Δ 2 ( f ^ ) 量化），并实现了对位置数字的全局处理。我们通过模拟研究了 DP 参数的影响：在我们的模拟场景中，真实的 AUC 测量值和分布式 AUC 测量值相差 Δ AUC 0.01，这在很大程度上取决于差异隐私参数的选择。建议在具体的模拟场景中检查分布式 AUC 估计器的准确性，同时合理选择 DP 参数。在这种情况下，分布式 AUC 估计器的准确性可能会因为 DP 人为添加的噪声过多而受到影响：我们算法的适用性取决于基础统计/预测模型的 ℓ 2 灵敏度 Δ 2 ( f ^ )。模拟结果表明，在大多数模拟情况下，近似误差是可以接受的。对于高 Δ 2 ( f ^ ) 的模型，隐私参数必须相应设置得更高，以确保足够的隐私保护，这会影响近似误差。这项工作表明，复杂的测量方法（如 AUC）适用于分布式设置中的验证，同时还能保护个人隐私。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Medical Research Methodology 医学-卫生保健

CiteScore

6.50

自引率

2.50%

发文量

298

审稿时长

3-8 weeks

期刊介绍： BMC Medical Research Methodology is an open access journal publishing original peer-reviewed research articles in methodological approaches to healthcare research. Articles on the methodology of epidemiological research, clinical trials and meta-analysis/systematic review are particularly encouraged, as are empirical studies of the associations between choice of methodology and study outcomes. BMC Medical Research Methodology does not aim to publish articles describing scientific methods or techniques: these should be directed to the BMC journal covering the relevant biomedical subject area.