BanStereoSet: A Dataset to Measure Stereotypical Social Biases in LLMs for Bangla

Mahammed Kamruzzaman, Abdullah Al Monsur, Shrabon Das, Enamul Hassan, Gene Louis Kim
{"title":"BanStereoSet: A Dataset to Measure Stereotypical Social Biases in LLMs for Bangla","authors":"Mahammed Kamruzzaman, Abdullah Al Monsur, Shrabon Das, Enamul Hassan, Gene Louis Kim","doi":"arxiv-2409.11638","DOIUrl":null,"url":null,"abstract":"This study presents BanStereoSet, a dataset designed to evaluate\nstereotypical social biases in multilingual LLMs for the Bangla language. In an\neffort to extend the focus of bias research beyond English-centric datasets, we\nhave localized the content from the StereoSet, IndiBias, and Kamruzzaman et.\nal.'s datasets, producing a resource tailored to capture biases prevalent\nwithin the Bangla-speaking community. Our BanStereoSet dataset consists of\n1,194 sentences spanning 9 categories of bias: race, profession, gender,\nageism, beauty, beauty in profession, region, caste, and religion. This dataset\nnot only serves as a crucial tool for measuring bias in multilingual LLMs but\nalso facilitates the exploration of stereotypical bias across different social\ncategories, potentially guiding the development of more equitable language\ntechnologies in Bangladeshi contexts. Our analysis of several language models\nusing this dataset indicates significant biases, reinforcing the necessity for\nculturally and linguistically adapted datasets to develop more equitable\nlanguage technologies.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"54 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11638","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This study presents BanStereoSet, a dataset designed to evaluate stereotypical social biases in multilingual LLMs for the Bangla language. In an effort to extend the focus of bias research beyond English-centric datasets, we have localized the content from the StereoSet, IndiBias, and Kamruzzaman et. al.'s datasets, producing a resource tailored to capture biases prevalent within the Bangla-speaking community. Our BanStereoSet dataset consists of 1,194 sentences spanning 9 categories of bias: race, profession, gender, ageism, beauty, beauty in profession, region, caste, and religion. This dataset not only serves as a crucial tool for measuring bias in multilingual LLMs but also facilitates the exploration of stereotypical bias across different social categories, potentially guiding the development of more equitable language technologies in Bangladeshi contexts. Our analysis of several language models using this dataset indicates significant biases, reinforcing the necessity for culturally and linguistically adapted datasets to develop more equitable language technologies.
BanStereoSet:测量孟加拉语词典中陈规定型社会偏见的数据集
本研究介绍的 BanStereoSet 是一个数据集,旨在评估孟加拉语多语种 LLM 中的社会偏见。为了将偏见研究的重点扩展到以英语为中心的数据集之外,我们对 StereoSet、IndiBias 和 Kamruzzaman 等人的数据集中的内容进行了本地化,生成了一个专门用于捕捉孟加拉语社区中普遍存在的偏见的资源。我们的 BanStereoSet 数据集包含 1194 个句子,涵盖 9 个偏见类别:种族、职业、性别、年龄歧视、美貌、职业中的美貌、地区、种姓和宗教。该数据集不仅是测量多语言 LLM 中偏见的重要工具,还有助于探索不同社会类别中的刻板偏见,从而为在孟加拉国环境中开发更公平的语言技术提供潜在指导。我们对使用该数据集的几个语言模型进行的分析表明,这些模型存在明显的偏差,这就更加说明,要开发更加公平的语言技术,就必须建立适应文化和语言的数据集。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信