信息丰富的机器学习中的“新兴代理”:对公平的威胁?

2023 IEEE International Symposium on Ethics in Engineering, Science, and Technology (ETHICS) Pub Date : 2023-05-18 DOI:10.1109/ETHICS57328.2023.10155045

A. McLoughney, J. Paterson, M. Cheong, Anthony Wirth

{"title":"信息丰富的机器学习中的“新兴代理”:对公平的威胁?","authors":"A. McLoughney, J. Paterson, M. Cheong, Anthony Wirth","doi":"10.1109/ETHICS57328.2023.10155045","DOIUrl":null,"url":null,"abstract":"Anti-discrimination law in many jurisdictions effectively bans the use of race and gender in automated decision-making. For example, this law means that insurance companies should not explicitly ask about legally protected attributes, e.g., race, in order to tailor their premiums to particular customers. In legal terms, indirect discrimination occurs when a generally neutral rule or variable is used, but significantly negatively affects one demographic group. An emerging example of this concern is inclusion of proxy variables in Machine Learning (ML) models, where neutral variables are predictive of protected attributes. For example, postcodes or zip codes are representative of communities, and therefore racial demographics and social-economic class; i.e., a traditional example of ‘redlining’ pre-dating modern automated techniques [1]. The law struggles with proxy variables in machine learning: indirect discrimination cases are difficult to bring to court, particularly because finding substantial evidence that shows the indirect discrimination to be unlawful is difficult [2]. With more complex machine-learning models being developed for automated decision making, e.g., random forests or state-of-the-art deep neural networks, more data points on customers are accumulated [1], from a wide variety of sources. With such rich data, ML models can produce multiple interconnected correlations - such as that found in single neurons in a neural network, or single decision trees in a random forest - which are predictive of protected attributes, akin to traditional uses of discrete proxy variables. In this poster, we introduce the concept of \"emerging proxies\", that are a combination of several variables, from which the ML model could infer the protected attribute(s) of the individuals in the dataset. This concept differs from the traditional concept of proxies because rather than addressing a single proxy variable, a distribution of interconnected proxies would have to be addressed. Our contribution is to provide evidence for the capacity of complex ML models to identify protected attributes through the correlation of other variables. This correlation is not made explicitly through a discrete one to one relationship between variables, but through a many-to-one relationship. This contribution complements concerns raised in legal analyses of automated decision-making about proxies in ML models leading to indirect discrimination [3]. Our contribution shows that if an ML model contains “emerging proxies” for a protected attribute, the distribution of proxies will be a roadblock when attempting to de-bias the model, limiting the pathways available for addressing potential discrimination caused by the ML model.","PeriodicalId":203527,"journal":{"name":"2023 IEEE International Symposium on Ethics in Engineering, Science, and Technology (ETHICS)","volume":"170 1-2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"‘Emerging proxies’ in information-rich machine learning: a threat to fairness?\",\"authors\":\"A. McLoughney, J. Paterson, M. Cheong, Anthony Wirth\",\"doi\":\"10.1109/ETHICS57328.2023.10155045\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Anti-discrimination law in many jurisdictions effectively bans the use of race and gender in automated decision-making. For example, this law means that insurance companies should not explicitly ask about legally protected attributes, e.g., race, in order to tailor their premiums to particular customers. In legal terms, indirect discrimination occurs when a generally neutral rule or variable is used, but significantly negatively affects one demographic group. An emerging example of this concern is inclusion of proxy variables in Machine Learning (ML) models, where neutral variables are predictive of protected attributes. For example, postcodes or zip codes are representative of communities, and therefore racial demographics and social-economic class; i.e., a traditional example of ‘redlining’ pre-dating modern automated techniques [1]. The law struggles with proxy variables in machine learning: indirect discrimination cases are difficult to bring to court, particularly because finding substantial evidence that shows the indirect discrimination to be unlawful is difficult [2]. With more complex machine-learning models being developed for automated decision making, e.g., random forests or state-of-the-art deep neural networks, more data points on customers are accumulated [1], from a wide variety of sources. With such rich data, ML models can produce multiple interconnected correlations - such as that found in single neurons in a neural network, or single decision trees in a random forest - which are predictive of protected attributes, akin to traditional uses of discrete proxy variables. In this poster, we introduce the concept of \\\"emerging proxies\\\", that are a combination of several variables, from which the ML model could infer the protected attribute(s) of the individuals in the dataset. This concept differs from the traditional concept of proxies because rather than addressing a single proxy variable, a distribution of interconnected proxies would have to be addressed. Our contribution is to provide evidence for the capacity of complex ML models to identify protected attributes through the correlation of other variables. This correlation is not made explicitly through a discrete one to one relationship between variables, but through a many-to-one relationship. This contribution complements concerns raised in legal analyses of automated decision-making about proxies in ML models leading to indirect discrimination [3]. Our contribution shows that if an ML model contains “emerging proxies” for a protected attribute, the distribution of proxies will be a roadblock when attempting to de-bias the model, limiting the pathways available for addressing potential discrimination caused by the ML model.\",\"PeriodicalId\":203527,\"journal\":{\"name\":\"2023 IEEE International Symposium on Ethics in Engineering, Science, and Technology (ETHICS)\",\"volume\":\"170 1-2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE International Symposium on Ethics in Engineering, Science, and Technology (ETHICS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ETHICS57328.2023.10155045\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Symposium on Ethics in Engineering, Science, and Technology (ETHICS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ETHICS57328.2023.10155045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

许多司法管辖区的反歧视法有效地禁止在自动决策中使用种族和性别。例如，这项法律意味着保险公司不应该明确询问受法律保护的属性，例如种族，以便为特定客户量身定制保费。在法律上，当使用一般中立的规则或变量，但对一个人口群体产生重大负面影响时，就会发生间接歧视。这种担忧的一个新例子是在机器学习(ML)模型中包含代理变量，其中中性变量可以预测受保护的属性。例如，邮政编码或邮政编码代表社区，因此代表种族人口统计和社会经济阶层;也就是说，一个传统的“红线”的例子，在现代自动化技术之前[1]。法律与机器学习中的代理变量作斗争:间接歧视案件很难提交法庭，特别是因为很难找到表明间接歧视非法的实质性证据[2]。随着更复杂的机器学习模型被开发用于自动化决策，例如随机森林或最先进的深度神经网络，从各种各样的来源积累了更多关于客户的数据点[1]。有了如此丰富的数据，机器学习模型可以产生多个相互关联的相关性——比如在神经网络中的单个神经元中发现的相关性，或者在随机森林中的单个决策树中发现的相关性——这些相关性可以预测受保护的属性，类似于传统的离散代理变量的使用。在这张海报中，我们引入了“新兴代理”的概念，它是几个变量的组合，机器学习模型可以从中推断数据集中个体的受保护属性。这个概念与传统的代理概念不同，因为不是处理单个代理变量，而是必须处理相互连接的代理的分布。我们的贡献是为复杂ML模型通过其他变量的相关性识别受保护属性的能力提供证据。这种相关性不是通过变量之间离散的一对一关系明确实现的，而是通过多对一关系实现的。这一贡献补充了在ML模型中关于代理的自动决策导致间接歧视的法律分析中提出的问题[3]。我们的贡献表明，如果ML模型包含受保护属性的“新兴代理”，那么代理的分布将成为试图消除模型偏差的障碍，限制了解决ML模型引起的潜在歧视的可用途径。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

‘Emerging proxies’ in information-rich machine learning: a threat to fairness?

Anti-discrimination law in many jurisdictions effectively bans the use of race and gender in automated decision-making. For example, this law means that insurance companies should not explicitly ask about legally protected attributes, e.g., race, in order to tailor their premiums to particular customers. In legal terms, indirect discrimination occurs when a generally neutral rule or variable is used, but significantly negatively affects one demographic group. An emerging example of this concern is inclusion of proxy variables in Machine Learning (ML) models, where neutral variables are predictive of protected attributes. For example, postcodes or zip codes are representative of communities, and therefore racial demographics and social-economic class; i.e., a traditional example of ‘redlining’ pre-dating modern automated techniques [1]. The law struggles with proxy variables in machine learning: indirect discrimination cases are difficult to bring to court, particularly because finding substantial evidence that shows the indirect discrimination to be unlawful is difficult [2]. With more complex machine-learning models being developed for automated decision making, e.g., random forests or state-of-the-art deep neural networks, more data points on customers are accumulated [1], from a wide variety of sources. With such rich data, ML models can produce multiple interconnected correlations - such as that found in single neurons in a neural network, or single decision trees in a random forest - which are predictive of protected attributes, akin to traditional uses of discrete proxy variables. In this poster, we introduce the concept of "emerging proxies", that are a combination of several variables, from which the ML model could infer the protected attribute(s) of the individuals in the dataset. This concept differs from the traditional concept of proxies because rather than addressing a single proxy variable, a distribution of interconnected proxies would have to be addressed. Our contribution is to provide evidence for the capacity of complex ML models to identify protected attributes through the correlation of other variables. This correlation is not made explicitly through a discrete one to one relationship between variables, but through a many-to-one relationship. This contribution complements concerns raised in legal analyses of automated decision-making about proxies in ML models leading to indirect discrimination [3]. Our contribution shows that if an ML model contains “emerging proxies” for a protected attribute, the distribution of proxies will be a roadblock when attempting to de-bias the model, limiting the pathways available for addressing potential discrimination caused by the ML model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE International Symposium on Ethics in Engineering, Science, and Technology (ETHICS)

自引率

0.00%

发文量