Mahammed Kamruzzaman, Abdullah Al Monsur, Shrabon Das, Enamul Hassan, Gene Louis Kim
{"title":"BanStereoSet: A Dataset to Measure Stereotypical Social Biases in LLMs for Bangla","authors":"Mahammed Kamruzzaman, Abdullah Al Monsur, Shrabon Das, Enamul Hassan, Gene Louis Kim","doi":"arxiv-2409.11638","DOIUrl":null,"url":null,"abstract":"This study presents BanStereoSet, a dataset designed to evaluate\nstereotypical social biases in multilingual LLMs for the Bangla language. In an\neffort to extend the focus of bias research beyond English-centric datasets, we\nhave localized the content from the StereoSet, IndiBias, and Kamruzzaman et.\nal.'s datasets, producing a resource tailored to capture biases prevalent\nwithin the Bangla-speaking community. Our BanStereoSet dataset consists of\n1,194 sentences spanning 9 categories of bias: race, profession, gender,\nageism, beauty, beauty in profession, region, caste, and religion. This dataset\nnot only serves as a crucial tool for measuring bias in multilingual LLMs but\nalso facilitates the exploration of stereotypical bias across different social\ncategories, potentially guiding the development of more equitable language\ntechnologies in Bangladeshi contexts. Our analysis of several language models\nusing this dataset indicates significant biases, reinforcing the necessity for\nculturally and linguistically adapted datasets to develop more equitable\nlanguage technologies.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"54 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11638","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This study presents BanStereoSet, a dataset designed to evaluate
stereotypical social biases in multilingual LLMs for the Bangla language. In an
effort to extend the focus of bias research beyond English-centric datasets, we
have localized the content from the StereoSet, IndiBias, and Kamruzzaman et.
al.'s datasets, producing a resource tailored to capture biases prevalent
within the Bangla-speaking community. Our BanStereoSet dataset consists of
1,194 sentences spanning 9 categories of bias: race, profession, gender,
ageism, beauty, beauty in profession, region, caste, and religion. This dataset
not only serves as a crucial tool for measuring bias in multilingual LLMs but
also facilitates the exploration of stereotypical bias across different social
categories, potentially guiding the development of more equitable language
technologies in Bangladeshi contexts. Our analysis of several language models
using this dataset indicates significant biases, reinforcing the necessity for
culturally and linguistically adapted datasets to develop more equitable
language technologies.