The Data Artifacts Glossary: a community-based repository for bias on health datasets.

IF 12.1 2区医学 Q1 CELL BIOLOGY

Journal of Biomedical Science Pub Date : 2025-02-04 DOI:10.1186/s12929-024-01106-6

Rodrigo R Gameiro, Naira Link Woite, Christopher M Sauer, Sicheng Hao, Chrystinne Oliveira Fernandes, Anna E Premo, Alice Rangel Teixeira, Isabelle Resli, An-Kwok Ian Wong, Leo Anthony Celi

{"title":"The Data Artifacts Glossary: a community-based repository for bias on health datasets.","authors":"Rodrigo R Gameiro, Naira Link Woite, Christopher M Sauer, Sicheng Hao, Chrystinne Oliveira Fernandes, Anna E Premo, Alice Rangel Teixeira, Isabelle Resli, An-Kwok Ian Wong, Leo Anthony Celi","doi":"10.1186/s12929-024-01106-6","DOIUrl":null,"url":null,"abstract":"Background: The deployment of Artificial Intelligence (AI) in healthcare has the potential to transform patient care through improved diagnostics, personalized treatment plans, and more efficient resource management. However, the effectiveness and fairness of AI are critically dependent on the data it learns from. Biased datasets can lead to AI outputs that perpetuate disparities, particularly affecting social minorities and marginalized groups.Objective: This paper introduces the \"Data Artifacts Glossary\", a dynamic, open-source framework designed to systematically document and update potential biases in healthcare datasets. The aim is to provide a comprehensive tool that enhances the transparency and accuracy of AI applications in healthcare and contributes to understanding and addressing health inequities.Methods: Utilizing a methodology inspired by the Delphi method, a diverse team of experts conducted iterative rounds of discussions and literature reviews. The team synthesized insights to develop a comprehensive list of bias categories and designed the glossary's structure. The Data Artifacts Glossary was piloted using the MIMIC-IV dataset to validate its utility and structure.Results: The Data Artifacts Glossary adopts a collaborative approach modeled on successful open-source projects like Linux and Python. Hosted on GitHub, it utilizes robust version control and collaborative features, allowing stakeholders from diverse backgrounds to contribute. Through a rigorous peer review process managed by community members, the glossary ensures the continual refinement and accuracy of its contents. The implementation of the Data Artifacts Glossary with the MIMIC-IV dataset illustrates its utility. It categorizes biases, and facilitates their identification and understanding.Conclusion: The Data Artifacts Glossary serves as a vital resource for enhancing the integrity of AI applications in healthcare by providing a mechanism to recognize and mitigate dataset biases before they impact AI outputs. It not only aids in avoiding bias in model development but also contributes to understanding and addressing the root causes of health disparities.","PeriodicalId":15365,"journal":{"name":"Journal of Biomedical Science","volume":"32 1","pages":"14"},"PeriodicalIF":12.1000,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11792693/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Science","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12929-024-01106-6","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CELL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The deployment of Artificial Intelligence (AI) in healthcare has the potential to transform patient care through improved diagnostics, personalized treatment plans, and more efficient resource management. However, the effectiveness and fairness of AI are critically dependent on the data it learns from. Biased datasets can lead to AI outputs that perpetuate disparities, particularly affecting social minorities and marginalized groups.

Objective: This paper introduces the "Data Artifacts Glossary", a dynamic, open-source framework designed to systematically document and update potential biases in healthcare datasets. The aim is to provide a comprehensive tool that enhances the transparency and accuracy of AI applications in healthcare and contributes to understanding and addressing health inequities.

Methods: Utilizing a methodology inspired by the Delphi method, a diverse team of experts conducted iterative rounds of discussions and literature reviews. The team synthesized insights to develop a comprehensive list of bias categories and designed the glossary's structure. The Data Artifacts Glossary was piloted using the MIMIC-IV dataset to validate its utility and structure.

Results: The Data Artifacts Glossary adopts a collaborative approach modeled on successful open-source projects like Linux and Python. Hosted on GitHub, it utilizes robust version control and collaborative features, allowing stakeholders from diverse backgrounds to contribute. Through a rigorous peer review process managed by community members, the glossary ensures the continual refinement and accuracy of its contents. The implementation of the Data Artifacts Glossary with the MIMIC-IV dataset illustrates its utility. It categorizes biases, and facilitates their identification and understanding.

Conclusion: The Data Artifacts Glossary serves as a vital resource for enhancing the integrity of AI applications in healthcare by providing a mechanism to recognize and mitigate dataset biases before they impact AI outputs. It not only aids in avoiding bias in model development but also contributes to understanding and addressing the root causes of health disparities.

Abstract Image

查看原文本刊更多论文

数据工件术语表：基于社区的健康数据集偏差存储库。

背景：人工智能（AI）在医疗保健领域的部署有可能通过改进诊断、个性化治疗计划和更有效的资源管理来改变患者护理。然而，人工智能的有效性和公平性严重依赖于它所学习的数据。有偏见的数据集可能导致人工智能输出使差距永久化，特别是影响社会少数群体和边缘群体。目的：本文介绍了“数据工件术语表”，这是一个动态的开源框架，旨在系统地记录和更新医疗保健数据集中的潜在偏差。其目的是提供一个全面的工具，提高人工智能在卫生保健领域应用的透明度和准确性，并有助于理解和解决卫生不公平现象。方法：利用德尔菲法启发的方法论，一个不同的专家团队进行了反复的讨论和文献综述。该团队综合了见解，开发了一个全面的偏见类别列表，并设计了术语表的结构。使用MIMIC-IV数据集试用了Data Artifacts Glossary，以验证其实用性和结构。结果：Data Artifacts Glossary采用了一种以成功的开源项目（如Linux和Python）为模型的协作方法。它托管在GitHub上，利用强大的版本控制和协作功能，允许来自不同背景的利益相关者做出贡献。通过由社区成员管理的严格的同行评审过程，术语表确保其内容的持续改进和准确性。使用MIMIC-IV数据集实现的数据工件术语表说明了它的实用性。它对偏见进行分类，并促进对它们的识别和理解。结论：数据工件术语表提供了一种机制，可以在数据集偏差影响人工智能输出之前识别和减轻偏差，从而成为增强医疗保健中人工智能应用程序完整性的重要资源。它不仅有助于避免模型开发中的偏见，而且有助于理解和解决健康差异的根本原因。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Biomedical Science 医学-医学：研究与实验

CiteScore

18.50

自引率

0.90%

发文量

审稿时长

1 months

期刊介绍： The Journal of Biomedical Science is an open access, peer-reviewed journal that focuses on fundamental and molecular aspects of basic medical sciences. It emphasizes molecular studies of biomedical problems and mechanisms. The National Science and Technology Council (NSTC), Taiwan supports the journal and covers the publication costs for accepted articles. The journal aims to provide an international platform for interdisciplinary discussions and contribute to the advancement of medicine. It benefits both readers and authors by accelerating the dissemination of research information and providing maximum access to scholarly communication. All articles published in the Journal of Biomedical Science are included in various databases such as Biological Abstracts, BIOSIS, CABI, CAS, Citebase, Current contents, DOAJ, Embase, EmBiology, and Global Health, among others.