Automated Data Cleaning can Hurt Fairness in Machine Learning-Based Decision Making

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering Pub Date : 2024-02-13 DOI:10.1109/TKDE.2024.3365524

Shubha Guha;Falaah Arif Khan;Julia Stoyanovich;Sebastian Schelter

{"title":"Automated Data Cleaning can Hurt Fairness in Machine Learning-Based Decision Making","authors":"Shubha Guha;Falaah Arif Khan;Julia Stoyanovich;Sebastian Schelter","doi":"10.1109/TKDE.2024.3365524","DOIUrl":null,"url":null,"abstract":"In this paper, we interrogate whether data quality issues track demographic group membership (based on sex, race and age) and whether automated data cleaning — of the kind commonly used in production ML systems — impacts the fairness of predictions made by these systems. To the best of our knowledge, the impact of data cleaning on fairness in downstream tasks has not been investigated in the literature. We first analyse the tuples flagged by common error detection strategies in five research datasets. We find that, while specific data quality issues, such as higher rates of missing values, are associated with membership in historically disadvantaged groups, poor data quality does not generally track demographic group membership. As a follow-up, we conduct a large-scale empirical study on the impact of automated data cleaning on fairness, involving more than 26,000 model evaluations. We observe that, while automated data cleaning is unlikely to worsen accuracy, it is more likely to worsen fairness than to improve it, especially when the cleaning techniques are not carefully chosen. Furthermore, we find that the positive or negative impact of a particular cleaning technique often depends on the choice of fairness metric and group definition (single-attribute or intersectional). We make our code and experimental results publicly available. The analysis we conducted in this paper is difficult, primarily because it requires that we think holistically about disparities in data quality, disparities in the effectiveness of data cleaning methods, and impacts of such disparities on ML model performance for different demographic groups. Such holistic analysis can and should be supported by data engineering tools, and requires substantial data engineering research. Towards this goal, we discuss open research questions, envision the development of fairness-aware data cleaning methods, and their integration into complex pipelines for ML-based decision making.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"36 12","pages":"7368-7379"},"PeriodicalIF":8.9000,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10433778/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we interrogate whether data quality issues track demographic group membership (based on sex, race and age) and whether automated data cleaning — of the kind commonly used in production ML systems — impacts the fairness of predictions made by these systems. To the best of our knowledge, the impact of data cleaning on fairness in downstream tasks has not been investigated in the literature. We first analyse the tuples flagged by common error detection strategies in five research datasets. We find that, while specific data quality issues, such as higher rates of missing values, are associated with membership in historically disadvantaged groups, poor data quality does not generally track demographic group membership. As a follow-up, we conduct a large-scale empirical study on the impact of automated data cleaning on fairness, involving more than 26,000 model evaluations. We observe that, while automated data cleaning is unlikely to worsen accuracy, it is more likely to worsen fairness than to improve it, especially when the cleaning techniques are not carefully chosen. Furthermore, we find that the positive or negative impact of a particular cleaning technique often depends on the choice of fairness metric and group definition (single-attribute or intersectional). We make our code and experimental results publicly available. The analysis we conducted in this paper is difficult, primarily because it requires that we think holistically about disparities in data quality, disparities in the effectiveness of data cleaning methods, and impacts of such disparities on ML model performance for different demographic groups. Such holistic analysis can and should be supported by data engineering tools, and requires substantial data engineering research. Towards this goal, we discuss open research questions, envision the development of fairness-aware data cleaning methods, and their integration into complex pipelines for ML-based decision making.

查看原文本刊更多论文

自动数据清理会损害基于机器学习决策的公平性

在本文中，我们将探讨数据质量问题是否会跟踪人口统计群体成员（基于性别、种族和年龄），以及生产型 ML 系统中常用的自动数据清理是否会影响这些系统所做预测的公平性。据我们所知，文献中还没有研究过数据清理对下游任务公平性的影响。我们首先分析了五个研究数据集中常见错误检测策略标记的图元。我们发现，虽然特定的数据质量问题（如缺失值比例较高）与历史上的弱势群体成员身份有关，但糟糕的数据质量一般不会跟踪人口群体成员身份。作为后续研究，我们就自动数据清理对公平性的影响进行了大规模的实证研究，涉及 26000 多个模型评估。我们发现，虽然自动数据清洗不太可能降低准确性，但它更有可能降低公平性，而不是提高公平性，尤其是在清洗技术选择不慎的情况下。此外，我们还发现，特定清理技术的积极或消极影响往往取决于对公平性指标和群体定义（单一属性或交叉）的选择。我们公开我们的代码和实验结果。我们在本文中进行的分析之所以困难，主要是因为它要求我们从整体上考虑数据质量的差异、数据清洗方法效果的差异以及这些差异对不同人口群体的 ML 模型性能的影响。这种整体分析可以而且应该得到数据工程工具的支持，并需要大量的数据工程研究。为了实现这一目标，我们将讨论开放式研究问题，设想开发公平感知数据清洗方法，并将其集成到复杂的管道中，用于基于 ML 的决策制定。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Knowledge and Data Engineering 工程技术-工程：电子与电气

CiteScore

11.70

自引率

3.40%

发文量

515

审稿时长

6 months

期刊介绍： The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.