{"title":"基于数据独立特征的多语言性别分类","authors":"T. Isbister, Lisa Kaati, Katie Cohen","doi":"10.1109/EISIC.2017.16","DOIUrl":null,"url":null,"abstract":"Gender classification is a well-researched problem, and state-of-the-art implementations achieve an accuracy of over 85%. However, most previous work has focused on gender classification of texts written in the English language, and in many cases, the results cannot be transferred to different datasets since the features used to train the machine learning models are dependent on the data. In this work, we investigate the possibilities to classify the gender of an author on five different languages: English, Swedish, French, Spanish, and Russian. We use features of the word counting program Linguistic Inquiry and Word Count (LIWC) with the benefit that these features are independent of the dataset. Our results show that by using machine learning with features from LIWC, we can obtain an accuracy of 79% and 73% depending on the language. We also, show some interesting differences between the uses of certain categories among the genders in different languages.","PeriodicalId":436947,"journal":{"name":"2017 European Intelligence and Security Informatics Conference (EISIC)","volume":"1079 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Gender Classification with Data Independent Features in Multiple Languages\",\"authors\":\"T. Isbister, Lisa Kaati, Katie Cohen\",\"doi\":\"10.1109/EISIC.2017.16\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Gender classification is a well-researched problem, and state-of-the-art implementations achieve an accuracy of over 85%. However, most previous work has focused on gender classification of texts written in the English language, and in many cases, the results cannot be transferred to different datasets since the features used to train the machine learning models are dependent on the data. In this work, we investigate the possibilities to classify the gender of an author on five different languages: English, Swedish, French, Spanish, and Russian. We use features of the word counting program Linguistic Inquiry and Word Count (LIWC) with the benefit that these features are independent of the dataset. Our results show that by using machine learning with features from LIWC, we can obtain an accuracy of 79% and 73% depending on the language. We also, show some interesting differences between the uses of certain categories among the genders in different languages.\",\"PeriodicalId\":436947,\"journal\":{\"name\":\"2017 European Intelligence and Security Informatics Conference (EISIC)\",\"volume\":\"1079 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 European Intelligence and Security Informatics Conference (EISIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/EISIC.2017.16\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 European Intelligence and Security Informatics Conference (EISIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EISIC.2017.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Gender Classification with Data Independent Features in Multiple Languages
Gender classification is a well-researched problem, and state-of-the-art implementations achieve an accuracy of over 85%. However, most previous work has focused on gender classification of texts written in the English language, and in many cases, the results cannot be transferred to different datasets since the features used to train the machine learning models are dependent on the data. In this work, we investigate the possibilities to classify the gender of an author on five different languages: English, Swedish, French, Spanish, and Russian. We use features of the word counting program Linguistic Inquiry and Word Count (LIWC) with the benefit that these features are independent of the dataset. Our results show that by using machine learning with features from LIWC, we can obtain an accuracy of 79% and 73% depending on the language. We also, show some interesting differences between the uses of certain categories among the genders in different languages.