An Empirical Study on the Fairness of Pre-trained Word Embeddings

E. Sesari, Max Hort, Federica Sarro
{"title":"An Empirical Study on the Fairness of Pre-trained Word Embeddings","authors":"E. Sesari, Max Hort, Federica Sarro","doi":"10.18653/v1/2022.gebnlp-1.15","DOIUrl":null,"url":null,"abstract":"Pre-trained word embedding models are easily distributed and applied, as they alleviate users from the effort to train models themselves. With widely distributed models, it is important to ensure that they do not exhibit undesired behaviour, such as biases against population groups. For this purpose, we carry out an empirical study on evaluating the bias of 15 publicly available, pre-trained word embeddings model based on three training algorithms (GloVe, word2vec, and fastText) with regard to four bias metrics (WEAT, SEMBIAS,DIRECT BIAS, and ECT). The choice of word embedding models and bias metrics is motivated by a literature survey over 37 publications which quantified bias on pre-trained word embeddings. Our results indicate that fastText is the least biased model (in 8 out of 12 cases) and small vector lengths lead to a higher bias.","PeriodicalId":161909,"journal":{"name":"Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.gebnlp-1.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Pre-trained word embedding models are easily distributed and applied, as they alleviate users from the effort to train models themselves. With widely distributed models, it is important to ensure that they do not exhibit undesired behaviour, such as biases against population groups. For this purpose, we carry out an empirical study on evaluating the bias of 15 publicly available, pre-trained word embeddings model based on three training algorithms (GloVe, word2vec, and fastText) with regard to four bias metrics (WEAT, SEMBIAS,DIRECT BIAS, and ECT). The choice of word embedding models and bias metrics is motivated by a literature survey over 37 publications which quantified bias on pre-trained word embeddings. Our results indicate that fastText is the least biased model (in 8 out of 12 cases) and small vector lengths lead to a higher bias.
预训练词嵌入公平性的实证研究
预训练的词嵌入模型很容易分发和应用,因为它们减轻了用户自己训练模型的工作量。对于广泛分布的模型,重要的是要确保它们不会表现出不希望的行为,例如对人口群体的偏见。为此,我们对基于三种训练算法(GloVe、word2vec和fastText)的15个公开可用的预训练词嵌入模型的偏差进行了实证研究,并对四个偏差指标(WEAT、SEMBIAS、DIRECT bias和ECT)进行了评估。词嵌入模型和偏差度量的选择是基于对37篇文献的调查,这些文献量化了预训练词嵌入的偏差。我们的结果表明,fastText是偏差最小的模型(在12个案例中有8个),较小的向量长度导致较高的偏差。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信