BanStereoSet: A Dataset to Measure Stereotypical Social Biases in LLMs for Bangla

arXiv - CS - Computation and Language Pub Date : 2024-09-18 DOI:arxiv-2409.11638

Mahammed Kamruzzaman, Abdullah Al Monsur, Shrabon Das, Enamul Hassan, Gene Louis Kim

{"title":"BanStereoSet: A Dataset to Measure Stereotypical Social Biases in LLMs for Bangla","authors":"Mahammed Kamruzzaman, Abdullah Al Monsur, Shrabon Das, Enamul Hassan, Gene Louis Kim","doi":"arxiv-2409.11638","DOIUrl":null,"url":null,"abstract":"This study presents BanStereoSet, a dataset designed to evaluate\nstereotypical social biases in multilingual LLMs for the Bangla language. In an\neffort to extend the focus of bias research beyond English-centric datasets, we\nhave localized the content from the StereoSet, IndiBias, and Kamruzzaman et.\nal.'s datasets, producing a resource tailored to capture biases prevalent\nwithin the Bangla-speaking community. Our BanStereoSet dataset consists of\n1,194 sentences spanning 9 categories of bias: race, profession, gender,\nageism, beauty, beauty in profession, region, caste, and religion. This dataset\nnot only serves as a crucial tool for measuring bias in multilingual LLMs but\nalso facilitates the exploration of stereotypical bias across different social\ncategories, potentially guiding the development of more equitable language\ntechnologies in Bangladeshi contexts. Our analysis of several language models\nusing this dataset indicates significant biases, reinforcing the necessity for\nculturally and linguistically adapted datasets to develop more equitable\nlanguage technologies.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"54 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11638","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This study presents BanStereoSet, a dataset designed to evaluate stereotypical social biases in multilingual LLMs for the Bangla language. In an effort to extend the focus of bias research beyond English-centric datasets, we have localized the content from the StereoSet, IndiBias, and Kamruzzaman et. al.'s datasets, producing a resource tailored to capture biases prevalent within the Bangla-speaking community. Our BanStereoSet dataset consists of 1,194 sentences spanning 9 categories of bias: race, profession, gender, ageism, beauty, beauty in profession, region, caste, and religion. This dataset not only serves as a crucial tool for measuring bias in multilingual LLMs but also facilitates the exploration of stereotypical bias across different social categories, potentially guiding the development of more equitable language technologies in Bangladeshi contexts. Our analysis of several language models using this dataset indicates significant biases, reinforcing the necessity for culturally and linguistically adapted datasets to develop more equitable language technologies.

查看原文本刊更多论文

BanStereoSet：测量孟加拉语词典中陈规定型社会偏见的数据集

本研究介绍的 BanStereoSet 是一个数据集，旨在评估孟加拉语多语种 LLM 中的社会偏见。为了将偏见研究的重点扩展到以英语为中心的数据集之外，我们对 StereoSet、IndiBias 和 Kamruzzaman 等人的数据集中的内容进行了本地化，生成了一个专门用于捕捉孟加拉语社区中普遍存在的偏见的资源。我们的 BanStereoSet 数据集包含 1194 个句子，涵盖 9 个偏见类别：种族、职业、性别、年龄歧视、美貌、职业中的美貌、地区、种姓和宗教。该数据集不仅是测量多语言 LLM 中偏见的重要工具，还有助于探索不同社会类别中的刻板偏见，从而为在孟加拉国环境中开发更公平的语言技术提供潜在指导。我们对使用该数据集的几个语言模型进行的分析表明，这些模型存在明显的偏差，这就更加说明，要开发更加公平的语言技术，就必须建立适应文化和语言的数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量