性别检测工具预测中文姓名性别的准确度如何?2万个人名的拼音研究

P. Sebo
{"title":"性别检测工具预测中文姓名性别的准确度如何?2万个人名的拼音研究","authors":"P. Sebo","doi":"10.5195/jmla.2022.1289","DOIUrl":null,"url":null,"abstract":"Objective: We recently showed that the gender detection tools NamSor, Gender API, and Wiki-Gendersort accurately predicted the gender of individuals with Western given names. Here, we aimed to evaluate the performance of these tools with Chinese given names in Pinyin format. Methods: We constructed two datasets for the purpose of the study. File #1 was created by randomly drawing 20,000 names from a gender-labeled database of 52,414 Chinese given names in Pinyin format. File #2, which contained 9,077 names, was created by removing from File #1 all unisex names that we were able to identify (i.e., those that were listed in the database as both male and female names). We recorded for both files the number of correct classifications (correct gender assigned to a name), misclassifications (wrong gender assigned to a name), and nonclassifications (no gender assigned). We then calculated the proportion of misclassifications and nonclassifications (errorCoded). Results: For File #1, errorCoded was 53% for NamSor, 65% for Gender API, and 90% for Wiki-Gendersort. For File #2, errorCoded was 43% for NamSor, 66% for Gender API, and 94% for Wiki-Gendersort. Conclusion: We found that all three gender detection tools inaccurately predicted the gender of individuals with Chinese given names in Pinyin format and therefore should not be used in this population.","PeriodicalId":227502,"journal":{"name":"Journal of the Medical Library Association : JMLA","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"How accurate are gender detection tools in predicting the gender for Chinese names? A study with 20,000 given names in Pinyin format\",\"authors\":\"P. Sebo\",\"doi\":\"10.5195/jmla.2022.1289\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective: We recently showed that the gender detection tools NamSor, Gender API, and Wiki-Gendersort accurately predicted the gender of individuals with Western given names. Here, we aimed to evaluate the performance of these tools with Chinese given names in Pinyin format. Methods: We constructed two datasets for the purpose of the study. File #1 was created by randomly drawing 20,000 names from a gender-labeled database of 52,414 Chinese given names in Pinyin format. File #2, which contained 9,077 names, was created by removing from File #1 all unisex names that we were able to identify (i.e., those that were listed in the database as both male and female names). We recorded for both files the number of correct classifications (correct gender assigned to a name), misclassifications (wrong gender assigned to a name), and nonclassifications (no gender assigned). We then calculated the proportion of misclassifications and nonclassifications (errorCoded). Results: For File #1, errorCoded was 53% for NamSor, 65% for Gender API, and 90% for Wiki-Gendersort. For File #2, errorCoded was 43% for NamSor, 66% for Gender API, and 94% for Wiki-Gendersort. Conclusion: We found that all three gender detection tools inaccurately predicted the gender of individuals with Chinese given names in Pinyin format and therefore should not be used in this population.\",\"PeriodicalId\":227502,\"journal\":{\"name\":\"Journal of the Medical Library Association : JMLA\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the Medical Library Association : JMLA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5195/jmla.2022.1289\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Medical Library Association : JMLA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5195/jmla.2022.1289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

目的:我们最近发现性别检测工具NamSor、gender API和Wiki-Gendersort能够准确预测西方名字个体的性别。在这里,我们的目的是评估这些工具的性能与汉语拼音格式的名字。方法:为研究目的,我们构建了两个数据集。文件#1是从52,414个汉语拼音名字的性别标记数据库中随机抽取2万个名字创建的。文件#2包含9,077个名字,通过从文件#1中删除我们能够识别的所有男女通用的名字(即,在数据库中列出的男性和女性名字)来创建。我们记录了这两个文件的正确分类(正确的性别分配给一个名字)、错误分类(错误的性别分配给一个名字)和非分类(没有分配性别)的数量。然后我们计算了错误分类和非分类的比例(errorCoded)。结果:对于文件#1,NamSor的errorCoded为53%,Gender API为65%,Wiki-Gendersort为90%。对于文件#2,NamSor的errorCoded为43%,Gender API为66%,Wiki-Gendersort为94%。结论:我们发现这三种性别检测工具都不能准确地预测具有汉语拼音形式姓名的个体的性别,因此不应在该人群中使用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
How accurate are gender detection tools in predicting the gender for Chinese names? A study with 20,000 given names in Pinyin format
Objective: We recently showed that the gender detection tools NamSor, Gender API, and Wiki-Gendersort accurately predicted the gender of individuals with Western given names. Here, we aimed to evaluate the performance of these tools with Chinese given names in Pinyin format. Methods: We constructed two datasets for the purpose of the study. File #1 was created by randomly drawing 20,000 names from a gender-labeled database of 52,414 Chinese given names in Pinyin format. File #2, which contained 9,077 names, was created by removing from File #1 all unisex names that we were able to identify (i.e., those that were listed in the database as both male and female names). We recorded for both files the number of correct classifications (correct gender assigned to a name), misclassifications (wrong gender assigned to a name), and nonclassifications (no gender assigned). We then calculated the proportion of misclassifications and nonclassifications (errorCoded). Results: For File #1, errorCoded was 53% for NamSor, 65% for Gender API, and 90% for Wiki-Gendersort. For File #2, errorCoded was 43% for NamSor, 66% for Gender API, and 94% for Wiki-Gendersort. Conclusion: We found that all three gender detection tools inaccurately predicted the gender of individuals with Chinese given names in Pinyin format and therefore should not be used in this population.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信