{"title":"Empirical analysis of fairness-aware data segmentation","authors":"Seiji Okura, T. Mohri","doi":"10.1109/ICDMW58026.2022.00029","DOIUrl":null,"url":null,"abstract":"Fairness in machine learning is a research area that is recently established, for mitigating bias of unfair models that treat unprivileged people unfavorably based on protected attributes. We want to take an approach for mitigating such bias based on the idea of data segmentation, that is, dividing data into segments where people should be treated similarly. Such an approach should be useful in the sense that the mitigation process itself is explainable for cases that similar people should be treated similarly. Although research on such cases exists, the question of effectiveness of data segmentation itself, however, remains to be answered. In this paper, we answer this question by empirically analyzing the experimental results of data segmentation by using two datasets, i.e., the UCI Adult dataset and the Kaggle ‘Give me some credit’ (gmsc) dataset. We empirically show that (1) fairness can be controllable during training models by the way of dividing data into segments; more specifically, by selecting the attributes and setting the number of segments for adjusting statistics such as statistical parity of the segments and mutual information between the attributes, etc. (2) the effects of data segmentation is dependent on classifiers, and (3) there exist weak trade-offs between fairness and accuracy with regard to data segmentation.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW58026.2022.00029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Fairness in machine learning is a research area that is recently established, for mitigating bias of unfair models that treat unprivileged people unfavorably based on protected attributes. We want to take an approach for mitigating such bias based on the idea of data segmentation, that is, dividing data into segments where people should be treated similarly. Such an approach should be useful in the sense that the mitigation process itself is explainable for cases that similar people should be treated similarly. Although research on such cases exists, the question of effectiveness of data segmentation itself, however, remains to be answered. In this paper, we answer this question by empirically analyzing the experimental results of data segmentation by using two datasets, i.e., the UCI Adult dataset and the Kaggle ‘Give me some credit’ (gmsc) dataset. We empirically show that (1) fairness can be controllable during training models by the way of dividing data into segments; more specifically, by selecting the attributes and setting the number of segments for adjusting statistics such as statistical parity of the segments and mutual information between the attributes, etc. (2) the effects of data segmentation is dependent on classifiers, and (3) there exist weak trade-offs between fairness and accuracy with regard to data segmentation.
机器学习中的公平性是最近建立的一个研究领域,旨在减轻基于受保护属性对弱势群体不利的不公平模型的偏见。我们希望采取一种基于数据分割思想的方法来减轻这种偏见,也就是说,将数据划分为人们应该被类似对待的部分。这种做法应该是有用的,因为缓解过程本身可以解释类似的人应该受到类似对待的情况。虽然对此类案例进行了研究,但数据分割本身的有效性问题仍有待解决。本文通过对UCI Adult数据集和Kaggle“Give me some credit”(gmsc)数据集的数据分割实验结果进行实证分析,回答了这个问题。我们的实证研究表明:(1)在训练模型过程中,通过将数据分段的方式,公平性是可控的;更具体地说,通过选择属性和设置段数来调整统计数据,如段的统计奇偶性和属性之间的相互信息等(2)数据分割的效果依赖于分类器;(3)数据分割的公平性和准确性之间存在较弱的权衡。