{"title":"Gender Classification using Twitter Text Data","authors":"Pradeep Vashisth, Kevin Meehan","doi":"10.1109/ISSC49989.2020.9180161","DOIUrl":null,"url":null,"abstract":"Increasingly content sharing websites such as social media have become very popular in many countries across the world. Classifying the gender of a person based on these short messages is an interesting research area that could benefit legal investigation, forensics, marketing analysis, advertising and recommendation. This research will explore the use of Natural Language Processing (NLP) techniques and tweets in a gender classification system. This investigation will compare multiple techniques such as Bag of Words (Term Frequency - Inverse Document Frequency), Word Embedding (W2Vec, GloVe) and traditional Machine Learning techniques (Logistic Regression, Support Vector Machine and Naïve Bayes) in this context. A new dataset has been generated to be used as part of this study comprising of the user gender and associated tweets. This dataset was developed due to the unavailability of any public standard dataset with the volume required to perform this investigation. The results have determined that the traditional Bag of Words model did not provide any significant results in classification. However, word embedding models have significantly performed better using multiple machine learning techniques. Therefore, the word embedding models have been proven to be the most effective technique in classifying gender based on twitter text data.","PeriodicalId":351013,"journal":{"name":"2020 31st Irish Signals and Systems Conference (ISSC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 31st Irish Signals and Systems Conference (ISSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSC49989.2020.9180161","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15
Abstract
Increasingly content sharing websites such as social media have become very popular in many countries across the world. Classifying the gender of a person based on these short messages is an interesting research area that could benefit legal investigation, forensics, marketing analysis, advertising and recommendation. This research will explore the use of Natural Language Processing (NLP) techniques and tweets in a gender classification system. This investigation will compare multiple techniques such as Bag of Words (Term Frequency - Inverse Document Frequency), Word Embedding (W2Vec, GloVe) and traditional Machine Learning techniques (Logistic Regression, Support Vector Machine and Naïve Bayes) in this context. A new dataset has been generated to be used as part of this study comprising of the user gender and associated tweets. This dataset was developed due to the unavailability of any public standard dataset with the volume required to perform this investigation. The results have determined that the traditional Bag of Words model did not provide any significant results in classification. However, word embedding models have significantly performed better using multiple machine learning techniques. Therefore, the word embedding models have been proven to be the most effective technique in classifying gender based on twitter text data.