On responsible machine learning datasets emphasizing fairness, privacy and regulatory norms with examples in biometrics and healthcare

IF 18.8 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Nature Machine Intelligence Pub Date : 2024-08-12 DOI:10.1038/s42256-024-00874-y

Surbhi Mittal, Kartik Thakral, Richa Singh, Mayank Vatsa, Tamar Glaser, Cristian Canton Ferrer, Tal Hassner

{"title":"On responsible machine learning datasets emphasizing fairness, privacy and regulatory norms with examples in biometrics and healthcare","authors":"Surbhi Mittal, Kartik Thakral, Richa Singh, Mayank Vatsa, Tamar Glaser, Cristian Canton Ferrer, Tal Hassner","doi":"10.1038/s42256-024-00874-y","DOIUrl":null,"url":null,"abstract":"Artificial Intelligence (AI) has seamlessly integrated into numerous scientific domains, catalysing unparalleled enhancements across a broad spectrum of tasks; however, its integrity and trustworthiness have emerged as notable concerns. The scientific community has focused on the development of trustworthy AI algorithms; however, machine learning and deep learning algorithms, popular in the AI community today, intrinsically rely on the quality of their training data. These algorithms are designed to detect patterns within the data, thereby learning the intended behavioural objectives. Any inadequacy in the data has the potential to translate directly into algorithms. In this study we discuss the importance of responsible machine learning datasets through the lens of fairness, privacy and regulatory compliance, and present a large audit of computer vision datasets. Despite the ubiquity of fairness and privacy challenges across diverse data domains, current regulatory frameworks primarily address human-centric data concerns. We therefore focus our discussion on biometric and healthcare datasets, although the principles we outline are broadly applicable across various domains. The audit is conducted through evaluation of the proposed responsible rubric. After surveying over 100 datasets, our detailed analysis of 60 distinct datasets highlights a universal susceptibility to fairness, privacy and regulatory compliance issues. This finding emphasizes the urgent need for revising dataset creation methodologies within the scientific community, especially in light of global advancements in data protection legislation. We assert that our study is critically relevant in the contemporary AI context, offering insights and recommendations that are both timely and essential for the ongoing evolution of AI technologies. There are pervasive concerns related to fairness, privacy and regulatory compliance in machine learning applications in healthcare, necessitating a reevaluation of dataset creation practices. Mittal et al. examine various computer vision datasets, providing insights to foster responsible AI development.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"6 8","pages":"936-949"},"PeriodicalIF":18.8000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.nature.com/articles/s42256-024-00874-y.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.nature.com/articles/s42256-024-00874-y","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Artificial Intelligence (AI) has seamlessly integrated into numerous scientific domains, catalysing unparalleled enhancements across a broad spectrum of tasks; however, its integrity and trustworthiness have emerged as notable concerns. The scientific community has focused on the development of trustworthy AI algorithms; however, machine learning and deep learning algorithms, popular in the AI community today, intrinsically rely on the quality of their training data. These algorithms are designed to detect patterns within the data, thereby learning the intended behavioural objectives. Any inadequacy in the data has the potential to translate directly into algorithms. In this study we discuss the importance of responsible machine learning datasets through the lens of fairness, privacy and regulatory compliance, and present a large audit of computer vision datasets. Despite the ubiquity of fairness and privacy challenges across diverse data domains, current regulatory frameworks primarily address human-centric data concerns. We therefore focus our discussion on biometric and healthcare datasets, although the principles we outline are broadly applicable across various domains. The audit is conducted through evaluation of the proposed responsible rubric. After surveying over 100 datasets, our detailed analysis of 60 distinct datasets highlights a universal susceptibility to fairness, privacy and regulatory compliance issues. This finding emphasizes the urgent need for revising dataset creation methodologies within the scientific community, especially in light of global advancements in data protection legislation. We assert that our study is critically relevant in the contemporary AI context, offering insights and recommendations that are both timely and essential for the ongoing evolution of AI technologies. There are pervasive concerns related to fairness, privacy and regulatory compliance in machine learning applications in healthcare, necessitating a reevaluation of dataset creation practices. Mittal et al. examine various computer vision datasets, providing insights to foster responsible AI development.

Abstract Image

查看原文本刊更多论文

以生物识别和医疗保健领域为例，介绍强调公平、隐私和监管规范的负责任机器学习数据集

人工智能（AI）已无缝融入众多科学领域，在广泛的任务中催生了无与伦比的提升；然而，其完整性和可信度已成为值得关注的问题。科学界一直专注于开发值得信赖的人工智能算法；然而，当今人工智能界流行的机器学习和深度学习算法本质上依赖于其训练数据的质量。这些算法旨在检测数据中的模式，从而学习预期的行为目标。数据中的任何不足都有可能直接转化为算法。在本研究中，我们从公平、隐私和法规遵从的角度讨论了负责任的机器学习数据集的重要性，并对计算机视觉数据集进行了大规模审计。尽管公平性和隐私性挑战普遍存在于不同的数据领域，但当前的监管框架主要解决的是以人为中心的数据问题。因此，我们将讨论重点放在生物识别和医疗保健数据集上，尽管我们概述的原则广泛适用于各个领域。审计是通过评估建议的责任标准进行的。在对 100 多个数据集进行调查后，我们对 60 个不同的数据集进行了详细分析，结果表明，这些数据集普遍存在公平性、隐私性和法规遵从性问题。这一发现强调了在科学界修改数据集创建方法的迫切需要，尤其是在全球数据保护立法不断进步的情况下。我们断言，我们的研究与当代人工智能背景密切相关，为人工智能技术的不断发展提供了及时而重要的见解和建议。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Nature Machine Intelligence Multiple-

CiteScore

36.90

自引率

2.10%

发文量

127

期刊介绍： Nature Machine Intelligence is a distinguished publication that presents original research and reviews on various topics in machine learning, robotics, and AI. Our focus extends beyond these fields, exploring their profound impact on other scientific disciplines, as well as societal and industrial aspects. We recognize limitless possibilities wherein machine intelligence can augment human capabilities and knowledge in domains like scientific exploration, healthcare, medical diagnostics, and the creation of safe and sustainable cities, transportation, and agriculture. Simultaneously, we acknowledge the emergence of ethical, social, and legal concerns due to the rapid pace of advancements. To foster interdisciplinary discussions on these far-reaching implications, Nature Machine Intelligence serves as a platform for dialogue facilitated through Comments, News Features, News & Views articles, and Correspondence. Our goal is to encourage a comprehensive examination of these subjects. Similar to all Nature-branded journals, Nature Machine Intelligence operates under the guidance of a team of skilled editors. We adhere to a fair and rigorous peer-review process, ensuring high standards of copy-editing and production, swift publication, and editorial independence.