Towards Fairer Health Recommendations: finding informative unbiased samples via Word Sense Disambiguation

arXiv - CS - Computers and Society Pub Date : 2024-09-11 DOI:arxiv-2409.07424

Gavin Butts, Pegah Emdad, Jethro Lee, Shannon Song, Chiman Salavati, Willmar Sosa Diaz, Shiri Dori-Hacohen, Fabricio Murai

{"title":"Towards Fairer Health Recommendations: finding informative unbiased samples via Word Sense Disambiguation","authors":"Gavin Butts, Pegah Emdad, Jethro Lee, Shannon Song, Chiman Salavati, Willmar Sosa Diaz, Shiri Dori-Hacohen, Fabricio Murai","doi":"arxiv-2409.07424","DOIUrl":null,"url":null,"abstract":"There have been growing concerns around high-stake applications that rely on\nmodels trained with biased data, which consequently produce biased predictions,\noften harming the most vulnerable. In particular, biased medical data could\ncause health-related applications and recommender systems to create outputs\nthat jeopardize patient care and widen disparities in health outcomes. A recent\nframework titled Fairness via AI posits that, instead of attempting to correct\nmodel biases, researchers must focus on their root causes by using AI to debias\ndata. Inspired by this framework, we tackle bias detection in medical curricula\nusing NLP models, including LLMs, and evaluate them on a gold standard dataset\ncontaining 4,105 excerpts annotated by medical experts for bias from a large\ncorpus. We build on previous work by coauthors which augments the set of\nnegative samples with non-annotated text containing social identifier terms.\nHowever, some of these terms, especially those related to race and ethnicity,\ncan carry different meanings (e.g., \"white matter of spinal cord\"). To address\nthis issue, we propose the use of Word Sense Disambiguation models to refine\ndataset quality by removing irrelevant sentences. We then evaluate fine-tuned\nvariations of BERT models as well as GPT models with zero- and few-shot\nprompting. We found LLMs, considered SOTA on many NLP tasks, unsuitable for\nbias detection, while fine-tuned BERT models generally perform well across all\nevaluated metrics.","PeriodicalId":501112,"journal":{"name":"arXiv - CS - Computers and Society","volume":"157 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computers and Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07424","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

There have been growing concerns around high-stake applications that rely on models trained with biased data, which consequently produce biased predictions, often harming the most vulnerable. In particular, biased medical data could cause health-related applications and recommender systems to create outputs that jeopardize patient care and widen disparities in health outcomes. A recent framework titled Fairness via AI posits that, instead of attempting to correct model biases, researchers must focus on their root causes by using AI to debias data. Inspired by this framework, we tackle bias detection in medical curricula using NLP models, including LLMs, and evaluate them on a gold standard dataset containing 4,105 excerpts annotated by medical experts for bias from a large corpus. We build on previous work by coauthors which augments the set of negative samples with non-annotated text containing social identifier terms. However, some of these terms, especially those related to race and ethnicity, can carry different meanings (e.g., "white matter of spinal cord"). To address this issue, we propose the use of Word Sense Disambiguation models to refine dataset quality by removing irrelevant sentences. We then evaluate fine-tuned variations of BERT models as well as GPT models with zero- and few-shot prompting. We found LLMs, considered SOTA on many NLP tasks, unsuitable for bias detection, while fine-tuned BERT models generally perform well across all evaluated metrics.

查看原文本刊更多论文

实现更公平的健康建议：通过词义消歧找到信息丰富的无偏样本

越来越多的人开始关注那些依赖于有偏见数据训练的模型的高风险应用，这些模型会产生有偏见的预测，往往会伤害到最脆弱的群体。特别是，有偏见的医疗数据可能会导致与健康相关的应用和推荐系统产生危害病人护理和扩大健康结果差距的输出。最近一个名为 "通过人工智能实现公平"（Fairness via AI）的框架认为，研究人员不应试图纠正模型偏差，而必须通过使用人工智能来消除数据偏差，从而关注其根本原因。受这一框架的启发，我们利用包括 LLM 在内的 NLP 模型来解决医学课程中的偏差检测问题，并在一个金标准数据集上对其进行评估，该数据集包含由医学专家注释的 4105 个摘录，这些摘录来自一个大型语料库，存在偏差。然而，其中一些术语，尤其是与种族和民族相关的术语，可能具有不同的含义（如 "脊髓白质"）。为了解决这个问题，我们建议使用词义消歧模型，通过删除不相关的句子来提高数据集的质量。然后，我们评估了 BERT 模型的微调变量，以及零提示和少提示的 GPT 模型。我们发现，在许多 NLP 任务中被认为是 SOTA 的 LLMs 并不适合偏误检测，而经过微调的 BERT 模型在所有评估指标中表现一般。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Computers and Society

自引率

0.00%

发文量