性别检测工具预测中文姓名性别的准确度如何?2万个人名的拼音研究

Journal of the Medical Library Association : JMLA Pub Date : 2021-10-17 DOI:10.5195/jmla.2022.1289

P. Sebo

{"title":"性别检测工具预测中文姓名性别的准确度如何?2万个人名的拼音研究","authors":"P. Sebo","doi":"10.5195/jmla.2022.1289","DOIUrl":null,"url":null,"abstract":"Objective: We recently showed that the gender detection tools NamSor, Gender API, and Wiki-Gendersort accurately predicted the gender of individuals with Western given names. Here, we aimed to evaluate the performance of these tools with Chinese given names in Pinyin format. Methods: We constructed two datasets for the purpose of the study. File #1 was created by randomly drawing 20,000 names from a gender-labeled database of 52,414 Chinese given names in Pinyin format. File #2, which contained 9,077 names, was created by removing from File #1 all unisex names that we were able to identify (i.e., those that were listed in the database as both male and female names). We recorded for both files the number of correct classifications (correct gender assigned to a name), misclassifications (wrong gender assigned to a name), and nonclassifications (no gender assigned). We then calculated the proportion of misclassifications and nonclassifications (errorCoded). Results: For File #1, errorCoded was 53% for NamSor, 65% for Gender API, and 90% for Wiki-Gendersort. For File #2, errorCoded was 43% for NamSor, 66% for Gender API, and 94% for Wiki-Gendersort. Conclusion: We found that all three gender detection tools inaccurately predicted the gender of individuals with Chinese given names in Pinyin format and therefore should not be used in this population.","PeriodicalId":227502,"journal":{"name":"Journal of the Medical Library Association : JMLA","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"How accurate are gender detection tools in predicting the gender for Chinese names? A study with 20,000 given names in Pinyin format\",\"authors\":\"P. Sebo\",\"doi\":\"10.5195/jmla.2022.1289\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective: We recently showed that the gender detection tools NamSor, Gender API, and Wiki-Gendersort accurately predicted the gender of individuals with Western given names. Here, we aimed to evaluate the performance of these tools with Chinese given names in Pinyin format. Methods: We constructed two datasets for the purpose of the study. File #1 was created by randomly drawing 20,000 names from a gender-labeled database of 52,414 Chinese given names in Pinyin format. File #2, which contained 9,077 names, was created by removing from File #1 all unisex names that we were able to identify (i.e., those that were listed in the database as both male and female names). We recorded for both files the number of correct classifications (correct gender assigned to a name), misclassifications (wrong gender assigned to a name), and nonclassifications (no gender assigned). We then calculated the proportion of misclassifications and nonclassifications (errorCoded). Results: For File #1, errorCoded was 53% for NamSor, 65% for Gender API, and 90% for Wiki-Gendersort. For File #2, errorCoded was 43% for NamSor, 66% for Gender API, and 94% for Wiki-Gendersort. Conclusion: We found that all three gender detection tools inaccurately predicted the gender of individuals with Chinese given names in Pinyin format and therefore should not be used in this population.\",\"PeriodicalId\":227502,\"journal\":{\"name\":\"Journal of the Medical Library Association : JMLA\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the Medical Library Association : JMLA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5195/jmla.2022.1289\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Medical Library Association : JMLA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5195/jmla.2022.1289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

目的:我们最近发现性别检测工具NamSor、gender API和Wiki-Gendersort能够准确预测西方名字个体的性别。在这里，我们的目的是评估这些工具的性能与汉语拼音格式的名字。方法:为研究目的，我们构建了两个数据集。文件#1是从52,414个汉语拼音名字的性别标记数据库中随机抽取2万个名字创建的。文件#2包含9,077个名字，通过从文件#1中删除我们能够识别的所有男女通用的名字(即，在数据库中列出的男性和女性名字)来创建。我们记录了这两个文件的正确分类(正确的性别分配给一个名字)、错误分类(错误的性别分配给一个名字)和非分类(没有分配性别)的数量。然后我们计算了错误分类和非分类的比例(errorCoded)。结果:对于文件#1,NamSor的errorCoded为53%，Gender API为65%，Wiki-Gendersort为90%。对于文件#2,NamSor的errorCoded为43%，Gender API为66%，Wiki-Gendersort为94%。结论:我们发现这三种性别检测工具都不能准确地预测具有汉语拼音形式姓名的个体的性别，因此不应在该人群中使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

How accurate are gender detection tools in predicting the gender for Chinese names? A study with 20,000 given names in Pinyin format

Objective: We recently showed that the gender detection tools NamSor, Gender API, and Wiki-Gendersort accurately predicted the gender of individuals with Western given names. Here, we aimed to evaluate the performance of these tools with Chinese given names in Pinyin format. Methods: We constructed two datasets for the purpose of the study. File #1 was created by randomly drawing 20,000 names from a gender-labeled database of 52,414 Chinese given names in Pinyin format. File #2, which contained 9,077 names, was created by removing from File #1 all unisex names that we were able to identify (i.e., those that were listed in the database as both male and female names). We recorded for both files the number of correct classifications (correct gender assigned to a name), misclassifications (wrong gender assigned to a name), and nonclassifications (no gender assigned). We then calculated the proportion of misclassifications and nonclassifications (errorCoded). Results: For File #1, errorCoded was 53% for NamSor, 65% for Gender API, and 90% for Wiki-Gendersort. For File #2, errorCoded was 43% for NamSor, 66% for Gender API, and 94% for Wiki-Gendersort. Conclusion: We found that all three gender detection tools inaccurately predicted the gender of individuals with Chinese given names in Pinyin format and therefore should not be used in this population.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of the Medical Library Association : JMLA

自引率

0.00%

发文量