Khaled AlNuaimi, Gautier Marti, Mathieu Ravaut, Abdulla AlKetbi, Andreas Henschel, Raed Jaradat
{"title":"通过大型语言模型用人口统计学丰富数据集:名字里有什么?","authors":"Khaled AlNuaimi, Gautier Marti, Mathieu Ravaut, Abdulla AlKetbi, Andreas Henschel, Raed Jaradat","doi":"arxiv-2409.11491","DOIUrl":null,"url":null,"abstract":"Enriching datasets with demographic information, such as gender, race, and\nage from names, is a critical task in fields like healthcare, public policy,\nand social sciences. Such demographic insights allow for more precise and\neffective engagement with target populations. Despite previous efforts\nemploying hidden Markov models and recurrent neural networks to predict\ndemographics from names, significant limitations persist: the lack of\nlarge-scale, well-curated, unbiased, publicly available datasets, and the lack\nof an approach robust across datasets. This scarcity has hindered the\ndevelopment of traditional supervised learning approaches. In this paper, we\ndemonstrate that the zero-shot capabilities of Large Language Models (LLMs) can\nperform as well as, if not better than, bespoke models trained on specialized\ndata. We apply these LLMs to a variety of datasets, including a real-life,\nunlabelled dataset of licensed financial professionals in Hong Kong, and\ncritically assess the inherent demographic biases in these models. Our work not\nonly advances the state-of-the-art in demographic enrichment but also opens\navenues for future research in mitigating biases in LLMs.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enriching Datasets with Demographics through Large Language Models: What's in a Name?\",\"authors\":\"Khaled AlNuaimi, Gautier Marti, Mathieu Ravaut, Abdulla AlKetbi, Andreas Henschel, Raed Jaradat\",\"doi\":\"arxiv-2409.11491\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Enriching datasets with demographic information, such as gender, race, and\\nage from names, is a critical task in fields like healthcare, public policy,\\nand social sciences. Such demographic insights allow for more precise and\\neffective engagement with target populations. Despite previous efforts\\nemploying hidden Markov models and recurrent neural networks to predict\\ndemographics from names, significant limitations persist: the lack of\\nlarge-scale, well-curated, unbiased, publicly available datasets, and the lack\\nof an approach robust across datasets. This scarcity has hindered the\\ndevelopment of traditional supervised learning approaches. In this paper, we\\ndemonstrate that the zero-shot capabilities of Large Language Models (LLMs) can\\nperform as well as, if not better than, bespoke models trained on specialized\\ndata. We apply these LLMs to a variety of datasets, including a real-life,\\nunlabelled dataset of licensed financial professionals in Hong Kong, and\\ncritically assess the inherent demographic biases in these models. Our work not\\nonly advances the state-of-the-art in demographic enrichment but also opens\\navenues for future research in mitigating biases in LLMs.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"30 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11491\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Enriching Datasets with Demographics through Large Language Models: What's in a Name?
Enriching datasets with demographic information, such as gender, race, and
age from names, is a critical task in fields like healthcare, public policy,
and social sciences. Such demographic insights allow for more precise and
effective engagement with target populations. Despite previous efforts
employing hidden Markov models and recurrent neural networks to predict
demographics from names, significant limitations persist: the lack of
large-scale, well-curated, unbiased, publicly available datasets, and the lack
of an approach robust across datasets. This scarcity has hindered the
development of traditional supervised learning approaches. In this paper, we
demonstrate that the zero-shot capabilities of Large Language Models (LLMs) can
perform as well as, if not better than, bespoke models trained on specialized
data. We apply these LLMs to a variety of datasets, including a real-life,
unlabelled dataset of licensed financial professionals in Hong Kong, and
critically assess the inherent demographic biases in these models. Our work not
only advances the state-of-the-art in demographic enrichment but also opens
avenues for future research in mitigating biases in LLMs.