医疗保健机器学习中的准确性和公平性权衡：减轻偏倚策略的定量评估

IF 4.3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information and Software Technology Pub Date : 2025-09-19 DOI:10.1016/j.infsof.2025.107896

Farzaneh Dehghani , Pedro Paiva , Nikita Malik , Joanna Lin , Sayeh Bayat , Mariana Bento

{"title":"医疗保健机器学习中的准确性和公平性权衡：减轻偏倚策略的定量评估","authors":"Farzaneh Dehghani , Pedro Paiva , Nikita Malik , Joanna Lin , Sayeh Bayat , Mariana Bento","doi":"10.1016/j.infsof.2025.107896","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>Although machine learning (ML) has significant potential to improve healthcare decision-making, embedded biases in algorithms and datasets risk exacerbating health disparities across demographic groups. To address this challenge, it is essential to rigorously evaluate bias mitigation strategies to ensure fairness and reliability across patient populations.</div></div><div><h3>Objective:</h3><div>The aim of this research is to propose a comprehensive evaluation framework that systematically assesses a wide range of bias mitigation techniques at pre-processing, in-processing, and post-processing stages, using both single- and multi-stage intervention approaches.</div></div><div><h3>Methods:</h3><div>This study evaluates bias mitigation strategies across three clinical prediction tasks: breast cancer diagnosis, stroke prediction, and Alzheimer’s disease detection. Our evaluation employs group- and individual-level fairness metrics, contextualized for specific sensitive attributes relevant to each dataset. Beyond fairness-accuracy trade-offs, we demonstrate how metric selection must align with clinical goals (e.g., parity metrics for equitable access, confusion-matrix metrics for diagnostics).</div></div><div><h3>Results:</h3><div>Our results reinforce that no single classifier or mitigation strategy is universally optimal, underscoring the value of our proposed framework for evaluating fairness and accuracy throughout the bias mitigation process. According to the results, Adversarial Debiasing improved fairness by 95% in breast cancer diagnosis without compromising accuracy. Reweighing was most effective in stroke prediction, boosting fairness by 41%, and Reject Option Classification yielded nearly 50% fairness improvement in Alzheimer’s detection. Multi-stage bias mitigation did not consistently lead to better outcomes, and in many cases, fairness gains came at the expense of accuracy.</div></div><div><h3>Conclusion:</h3><div>These findings provide practical guidance for selecting fairness-aware machine learning strategies in healthcare, aiding both model development and benchmarking across diverse clinical applications.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"188 ","pages":"Article 107896"},"PeriodicalIF":4.3000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accuracy-fairness trade-off in ML for healthcare: A quantitative evaluation of bias mitigation strategies\",\"authors\":\"Farzaneh Dehghani , Pedro Paiva , Nikita Malik , Joanna Lin , Sayeh Bayat , Mariana Bento\",\"doi\":\"10.1016/j.infsof.2025.107896\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Context:</h3><div>Although machine learning (ML) has significant potential to improve healthcare decision-making, embedded biases in algorithms and datasets risk exacerbating health disparities across demographic groups. To address this challenge, it is essential to rigorously evaluate bias mitigation strategies to ensure fairness and reliability across patient populations.</div></div><div><h3>Objective:</h3><div>The aim of this research is to propose a comprehensive evaluation framework that systematically assesses a wide range of bias mitigation techniques at pre-processing, in-processing, and post-processing stages, using both single- and multi-stage intervention approaches.</div></div><div><h3>Methods:</h3><div>This study evaluates bias mitigation strategies across three clinical prediction tasks: breast cancer diagnosis, stroke prediction, and Alzheimer’s disease detection. Our evaluation employs group- and individual-level fairness metrics, contextualized for specific sensitive attributes relevant to each dataset. Beyond fairness-accuracy trade-offs, we demonstrate how metric selection must align with clinical goals (e.g., parity metrics for equitable access, confusion-matrix metrics for diagnostics).</div></div><div><h3>Results:</h3><div>Our results reinforce that no single classifier or mitigation strategy is universally optimal, underscoring the value of our proposed framework for evaluating fairness and accuracy throughout the bias mitigation process. According to the results, Adversarial Debiasing improved fairness by 95% in breast cancer diagnosis without compromising accuracy. Reweighing was most effective in stroke prediction, boosting fairness by 41%, and Reject Option Classification yielded nearly 50% fairness improvement in Alzheimer’s detection. Multi-stage bias mitigation did not consistently lead to better outcomes, and in many cases, fairness gains came at the expense of accuracy.</div></div><div><h3>Conclusion:</h3><div>These findings provide practical guidance for selecting fairness-aware machine learning strategies in healthcare, aiding both model development and benchmarking across diverse clinical applications.</div></div>\",\"PeriodicalId\":54983,\"journal\":{\"name\":\"Information and Software Technology\",\"volume\":\"188 \",\"pages\":\"Article 107896\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information and Software Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950584925002356\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925002356","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

背景：尽管机器学习（ML）在改善医疗保健决策方面具有巨大潜力，但算法和数据集中的嵌入式偏见可能会加剧人口群体之间的健康差异。为了应对这一挑战，必须严格评估减轻偏倚的策略，以确保整个患者群体的公平性和可靠性。目的：本研究的目的是提出一个综合评估框架，系统地评估预处理、处理中和处理后阶段的各种偏见缓解技术，使用单阶段和多阶段干预方法。方法：本研究评估了三个临床预测任务的偏倚缓解策略：乳腺癌诊断、卒中预测和阿尔茨海默病检测。我们的评估采用群体和个人层面的公平指标，并根据与每个数据集相关的特定敏感属性进行背景化。除了公平与准确性的权衡之外，我们还展示了指标选择必须与临床目标保持一致（例如，公平获取的平价指标，诊断的混淆矩阵指标）。结果：我们的研究结果强调，没有单一的分类器或缓解策略是普遍最优的，强调了我们提出的框架在整个偏见缓解过程中评估公平性和准确性的价值。根据结果，对抗性去偏将乳腺癌诊断的公平性提高了95%，同时不影响准确性。重新衡量在中风预测方面最有效，将公平性提高了41%，而拒绝选项分类在阿尔茨海默病检测方面的公平性提高了近50%。多阶段偏见缓解并不总是带来更好的结果，在许多情况下，公平性的提高是以牺牲准确性为代价的。结论：这些发现为医疗保健中选择公平感知机器学习策略提供了实用指导，有助于模型开发和不同临床应用的基准测试。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Accuracy-fairness trade-off in ML for healthcare: A quantitative evaluation of bias mitigation strategies

Context:

Although machine learning (ML) has significant potential to improve healthcare decision-making, embedded biases in algorithms and datasets risk exacerbating health disparities across demographic groups. To address this challenge, it is essential to rigorously evaluate bias mitigation strategies to ensure fairness and reliability across patient populations.

Objective:

The aim of this research is to propose a comprehensive evaluation framework that systematically assesses a wide range of bias mitigation techniques at pre-processing, in-processing, and post-processing stages, using both single- and multi-stage intervention approaches.

Methods:

This study evaluates bias mitigation strategies across three clinical prediction tasks: breast cancer diagnosis, stroke prediction, and Alzheimer’s disease detection. Our evaluation employs group- and individual-level fairness metrics, contextualized for specific sensitive attributes relevant to each dataset. Beyond fairness-accuracy trade-offs, we demonstrate how metric selection must align with clinical goals (e.g., parity metrics for equitable access, confusion-matrix metrics for diagnostics).

Results:

Our results reinforce that no single classifier or mitigation strategy is universally optimal, underscoring the value of our proposed framework for evaluating fairness and accuracy throughout the bias mitigation process. According to the results, Adversarial Debiasing improved fairness by 95% in breast cancer diagnosis without compromising accuracy. Reweighing was most effective in stroke prediction, boosting fairness by 41%, and Reject Option Classification yielded nearly 50% fairness improvement in Alzheimer’s detection. Multi-stage bias mitigation did not consistently lead to better outcomes, and in many cases, fairness gains came at the expense of accuracy.

Conclusion:

These findings provide practical guidance for selecting fairness-aware machine learning strategies in healthcare, aiding both model development and benchmarking across diverse clinical applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information and Software Technology 工程技术-计算机：软件工程

CiteScore

9.10

自引率

7.70%

发文量

164

审稿时长

9.6 weeks

期刊介绍： Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.