Oliver J Bear Don't Walk Iv, Adrienne Pichon, Harry Reyes Nieva, Tony Sun, Jaan Altosaar, Karthik Natarajan, Adler Perotte, Peter Tarczy-Hornoch, Dina Demner-Fushman, Noémie Elhadad
{"title":"审核深度学习方法中的学习关联,从临床文本中提取种族和民族。","authors":"Oliver J Bear Don't Walk Iv, Adrienne Pichon, Harry Reyes Nieva, Tony Sun, Jaan Altosaar, Karthik Natarajan, Adler Perotte, Peter Tarczy-Hornoch, Dina Demner-Fushman, Noémie Elhadad","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Complete and accurate race and ethnicity (RE) patient information is important for many areas of biomedical informatics research, such as defining and characterizing cohorts, performing quality assessments, and identifying health inequities. Patient-level RE data is often inaccurate or missing in structured sources, but can be supplemented through clinical notes and natural language processing (NLP). While NLP has made many improvements in recent years with large language models, bias remains an often-unaddressed concern, with research showing that harmful and negative language is more often used for certain racial/ethnic groups than others. We present an approach to audit the learned associations of models trained to identify RE information in clinical text by measuring the concordance between model-derived salient features and manually identified RE-related spans of text. We show that while models perform well on the surface, there exist concerning learned associations and potential for future harms from RE-identification models if left unaddressed.</p>","PeriodicalId":72180,"journal":{"name":"AMIA ... Annual Symposium proceedings. AMIA Symposium","volume":"2023 ","pages":"289-298"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10785932/pdf/","citationCount":"0","resultStr":"{\"title\":\"Auditing Learned Associations in Deep Learning Approaches to Extract Race and Ethnicity from Clinical Text.\",\"authors\":\"Oliver J Bear Don't Walk Iv, Adrienne Pichon, Harry Reyes Nieva, Tony Sun, Jaan Altosaar, Karthik Natarajan, Adler Perotte, Peter Tarczy-Hornoch, Dina Demner-Fushman, Noémie Elhadad\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Complete and accurate race and ethnicity (RE) patient information is important for many areas of biomedical informatics research, such as defining and characterizing cohorts, performing quality assessments, and identifying health inequities. Patient-level RE data is often inaccurate or missing in structured sources, but can be supplemented through clinical notes and natural language processing (NLP). While NLP has made many improvements in recent years with large language models, bias remains an often-unaddressed concern, with research showing that harmful and negative language is more often used for certain racial/ethnic groups than others. We present an approach to audit the learned associations of models trained to identify RE information in clinical text by measuring the concordance between model-derived salient features and manually identified RE-related spans of text. We show that while models perform well on the surface, there exist concerning learned associations and potential for future harms from RE-identification models if left unaddressed.</p>\",\"PeriodicalId\":72180,\"journal\":{\"name\":\"AMIA ... Annual Symposium proceedings. AMIA Symposium\",\"volume\":\"2023 \",\"pages\":\"289-298\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10785932/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AMIA ... Annual Symposium proceedings. AMIA Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2023/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AMIA ... Annual Symposium proceedings. AMIA Symposium","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
完整而准确的种族和民族(RE)患者信息对于生物医学信息学研究的许多领域都非常重要,例如定义和描述队列、进行质量评估以及识别健康不公平现象。患者级别的 RE 数据在结构化数据源中往往不准确或缺失,但可以通过临床笔记和自然语言处理 (NLP) 得到补充。近年来,NLP 在大型语言模型方面取得了许多进步,但偏见仍是一个经常未得到解决的问题,研究表明,对某些种族/民族群体使用有害和负面语言的频率高于其他群体。我们提出了一种方法,通过测量模型衍生的显著特征与人工识别的 RE 相关文本跨度之间的一致性,来审核为识别临床文本中的 RE 信息而训练的模型的学习关联。我们的研究表明,虽然模型表面上表现良好,但如果不加以解决,RE 识别模型存在着与所学关联相关的问题,并有可能在未来造成危害。
Auditing Learned Associations in Deep Learning Approaches to Extract Race and Ethnicity from Clinical Text.
Complete and accurate race and ethnicity (RE) patient information is important for many areas of biomedical informatics research, such as defining and characterizing cohorts, performing quality assessments, and identifying health inequities. Patient-level RE data is often inaccurate or missing in structured sources, but can be supplemented through clinical notes and natural language processing (NLP). While NLP has made many improvements in recent years with large language models, bias remains an often-unaddressed concern, with research showing that harmful and negative language is more often used for certain racial/ethnic groups than others. We present an approach to audit the learned associations of models trained to identify RE information in clinical text by measuring the concordance between model-derived salient features and manually identified RE-related spans of text. We show that while models perform well on the surface, there exist concerning learned associations and potential for future harms from RE-identification models if left unaddressed.