{"title":"Towards Fairer Health Recommendations: finding informative unbiased samples via Word Sense Disambiguation","authors":"Gavin Butts, Pegah Emdad, Jethro Lee, Shannon Song, Chiman Salavati, Willmar Sosa Diaz, Shiri Dori-Hacohen, Fabricio Murai","doi":"arxiv-2409.07424","DOIUrl":null,"url":null,"abstract":"There have been growing concerns around high-stake applications that rely on\nmodels trained with biased data, which consequently produce biased predictions,\noften harming the most vulnerable. In particular, biased medical data could\ncause health-related applications and recommender systems to create outputs\nthat jeopardize patient care and widen disparities in health outcomes. A recent\nframework titled Fairness via AI posits that, instead of attempting to correct\nmodel biases, researchers must focus on their root causes by using AI to debias\ndata. Inspired by this framework, we tackle bias detection in medical curricula\nusing NLP models, including LLMs, and evaluate them on a gold standard dataset\ncontaining 4,105 excerpts annotated by medical experts for bias from a large\ncorpus. We build on previous work by coauthors which augments the set of\nnegative samples with non-annotated text containing social identifier terms.\nHowever, some of these terms, especially those related to race and ethnicity,\ncan carry different meanings (e.g., \"white matter of spinal cord\"). To address\nthis issue, we propose the use of Word Sense Disambiguation models to refine\ndataset quality by removing irrelevant sentences. We then evaluate fine-tuned\nvariations of BERT models as well as GPT models with zero- and few-shot\nprompting. We found LLMs, considered SOTA on many NLP tasks, unsuitable for\nbias detection, while fine-tuned BERT models generally perform well across all\nevaluated metrics.","PeriodicalId":501112,"journal":{"name":"arXiv - CS - Computers and Society","volume":"157 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computers and Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07424","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
There have been growing concerns around high-stake applications that rely on
models trained with biased data, which consequently produce biased predictions,
often harming the most vulnerable. In particular, biased medical data could
cause health-related applications and recommender systems to create outputs
that jeopardize patient care and widen disparities in health outcomes. A recent
framework titled Fairness via AI posits that, instead of attempting to correct
model biases, researchers must focus on their root causes by using AI to debias
data. Inspired by this framework, we tackle bias detection in medical curricula
using NLP models, including LLMs, and evaluate them on a gold standard dataset
containing 4,105 excerpts annotated by medical experts for bias from a large
corpus. We build on previous work by coauthors which augments the set of
negative samples with non-annotated text containing social identifier terms.
However, some of these terms, especially those related to race and ethnicity,
can carry different meanings (e.g., "white matter of spinal cord"). To address
this issue, we propose the use of Word Sense Disambiguation models to refine
dataset quality by removing irrelevant sentences. We then evaluate fine-tuned
variations of BERT models as well as GPT models with zero- and few-shot
prompting. We found LLMs, considered SOTA on many NLP tasks, unsuitable for
bias detection, while fine-tuned BERT models generally perform well across all
evaluated metrics.