Fair-SSL: Building fair ML Software with less data

2022 IEEE/ACM International Workshop on Equitable Data & Technology (FairWare) Pub Date : 2021-11-03 DOI:10.1145/3524491.3527305

Joymallya Chakraborty, Suvodeep Majumder, Huy Tu

{"title":"Fair-SSL: Building fair ML Software with less data","authors":"Joymallya Chakraborty, Suvodeep Majumder, Huy Tu","doi":"10.1145/3524491.3527305","DOIUrl":null,"url":null,"abstract":"Ethical bias in machine learning models has become a matter of concern in the software engineering community. Most of the prior software engineering works concentrated on finding ethical bias in models rather than fixing it. After finding bias, the next step is mitigation. Prior researchers mainly tried to use supervised approaches to achieve fairness. However, in the real world, getting data with trustworthy ground truth is challenging and also ground truth can contain human bias. Semi-supervised learning is a technique where, incrementally, labeled data is used to generate pseudo-labels for the rest of data (and then all that data is used for model training). In this work, we apply four popular semi-supervised techniques as pseudo-labelers to create fair classification models. Our framework, Fair-SSL, takes a very small amount (10%) of labeled data as input and generates pseudo-labels for the unlabeled data. We then synthetically generate new data points to balance the training data based on class and protected attribute as proposed by Chakraborty et al. in FSE 2021. Finally, classification model is trained on the balanced pseudo-labeled data and validated on test data. After experimenting on ten datasets and three learners, we find that Fair-SSL achieves similar performance as three state-of-the-art bias mitigation algorithms. That said, the clear advantage of Fair-SSL is that it requires only 10% of the labeled training data. To the best of our knowledge, this is the first SE work where semi-supervised techniques are used to fight against ethical bias in SE ML models. To facilitate open science and replication, all our source code and datasets are publicly available at https://github.com/joymallyac/FairSSL. CCS CONCEPTS • Software and its engineering → Software creation and management; • Computing methodologies → Machine learning. ACM Reference Format: Joymallya Chakraborty, Suvodeep Majumder, and Huy Tu. 2022. Fair-SSL: Building fair ML Software with less data. In International Workshop on Equitable Data and Technology (FairWare ‘22), May 9, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3524491.3527305","PeriodicalId":287874,"journal":{"name":"2022 IEEE/ACM International Workshop on Equitable Data & Technology (FairWare)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM International Workshop on Equitable Data & Technology (FairWare)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3524491.3527305","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Ethical bias in machine learning models has become a matter of concern in the software engineering community. Most of the prior software engineering works concentrated on finding ethical bias in models rather than fixing it. After finding bias, the next step is mitigation. Prior researchers mainly tried to use supervised approaches to achieve fairness. However, in the real world, getting data with trustworthy ground truth is challenging and also ground truth can contain human bias. Semi-supervised learning is a technique where, incrementally, labeled data is used to generate pseudo-labels for the rest of data (and then all that data is used for model training). In this work, we apply four popular semi-supervised techniques as pseudo-labelers to create fair classification models. Our framework, Fair-SSL, takes a very small amount (10%) of labeled data as input and generates pseudo-labels for the unlabeled data. We then synthetically generate new data points to balance the training data based on class and protected attribute as proposed by Chakraborty et al. in FSE 2021. Finally, classification model is trained on the balanced pseudo-labeled data and validated on test data. After experimenting on ten datasets and three learners, we find that Fair-SSL achieves similar performance as three state-of-the-art bias mitigation algorithms. That said, the clear advantage of Fair-SSL is that it requires only 10% of the labeled training data. To the best of our knowledge, this is the first SE work where semi-supervised techniques are used to fight against ethical bias in SE ML models. To facilitate open science and replication, all our source code and datasets are publicly available at https://github.com/joymallyac/FairSSL. CCS CONCEPTS • Software and its engineering → Software creation and management; • Computing methodologies → Machine learning. ACM Reference Format: Joymallya Chakraborty, Suvodeep Majumder, and Huy Tu. 2022. Fair-SSL: Building fair ML Software with less data. In International Workshop on Equitable Data and Technology (FairWare ‘22), May 9, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3524491.3527305

查看原文本刊更多论文

公平ssl:用更少的数据构建公平的机器学习软件

机器学习模型中的伦理偏见已经成为软件工程界关注的问题。以前的大多数软件工程工作都集中在发现模型中的道德偏见，而不是修复它。在发现偏见之后，下一步就是缓解。以往的研究者主要尝试用监督的方法来实现公平。然而，在现实世界中，获得具有可信基础真理的数据是具有挑战性的，而且基础真理可能包含人类偏见。半监督学习是一种技术，其中逐渐使用标记数据为其余数据生成伪标签(然后所有这些数据用于模型训练)。在这项工作中，我们应用四种流行的半监督技术作为伪标记器来创建公平的分类模型。我们的框架Fair-SSL将非常少量(10%)的标记数据作为输入，并为未标记的数据生成伪标签。然后，我们根据Chakraborty等人在FSE 2021中提出的基于类和保护属性的方法，综合生成新的数据点来平衡训练数据。最后，在平衡的伪标记数据上对分类模型进行训练，并在测试数据上进行验证。在对十个数据集和三个学习器进行实验后，我们发现Fair-SSL实现了与三种最先进的偏差缓解算法相似的性能。也就是说，Fair-SSL的明显优势在于它只需要10%的标记训练数据。据我们所知，这是第一个使用半监督技术来对抗SE ML模型中的伦理偏见的SE工作。为了促进开放科学和复制，我们所有的源代码和数据集都可以在https://github.com/joymallyac/FairSSL上公开获取。•软件及其工程→软件创建和管理;•计算方法→机器学习。ACM参考格式:Joymallya Chakraborty, Suvodeep Majumder和Huy Tu。2022。公平ssl:用更少的数据构建公平的机器学习软件。在公平数据和技术国际研讨会(FairWare ' 22)， 2022年5月9日，匹兹堡，宾夕法尼亚州，美国。ACM，纽约，美国，8页。https://doi.org/10.1145/3524491.3527305

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE/ACM International Workshop on Equitable Data & Technology (FairWare)

自引率

0.00%

发文量