An Empirical Study on the Fairness of Pre-trained Word Embeddings

Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP) Pub Date : 1900-01-01 DOI:10.18653/v1/2022.gebnlp-1.15

E. Sesari, Max Hort, Federica Sarro

引用次数: 3

Abstract

Pre-trained word embedding models are easily distributed and applied, as they alleviate users from the effort to train models themselves. With widely distributed models, it is important to ensure that they do not exhibit undesired behaviour, such as biases against population groups. For this purpose, we carry out an empirical study on evaluating the bias of 15 publicly available, pre-trained word embeddings model based on three training algorithms (GloVe, word2vec, and fastText) with regard to four bias metrics (WEAT, SEMBIAS,DIRECT BIAS, and ECT). The choice of word embedding models and bias metrics is motivated by a literature survey over 37 publications which quantified bias on pre-trained word embeddings. Our results indicate that fastText is the least biased model (in 8 out of 12 cases) and small vector lengths lead to a higher bias.

查看原文本刊更多论文

预训练词嵌入公平性的实证研究

预训练的词嵌入模型很容易分发和应用，因为它们减轻了用户自己训练模型的工作量。对于广泛分布的模型，重要的是要确保它们不会表现出不希望的行为，例如对人口群体的偏见。为此，我们对基于三种训练算法(GloVe、word2vec和fastText)的15个公开可用的预训练词嵌入模型的偏差进行了实证研究，并对四个偏差指标(WEAT、SEMBIAS、DIRECT bias和ECT)进行了评估。词嵌入模型和偏差度量的选择是基于对37篇文献的调查，这些文献量化了预训练词嵌入的偏差。我们的结果表明，fastText是偏差最小的模型(在12个案例中有8个)，较小的向量长度导致较高的偏差。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

自引率

0.00%

发文量