Gender Classification using Twitter Text Data

2020 31st Irish Signals and Systems Conference (ISSC) Pub Date : 2020-06-01 DOI:10.1109/ISSC49989.2020.9180161

Pradeep Vashisth, Kevin Meehan

{"title":"Gender Classification using Twitter Text Data","authors":"Pradeep Vashisth, Kevin Meehan","doi":"10.1109/ISSC49989.2020.9180161","DOIUrl":null,"url":null,"abstract":"Increasingly content sharing websites such as social media have become very popular in many countries across the world. Classifying the gender of a person based on these short messages is an interesting research area that could benefit legal investigation, forensics, marketing analysis, advertising and recommendation. This research will explore the use of Natural Language Processing (NLP) techniques and tweets in a gender classification system. This investigation will compare multiple techniques such as Bag of Words (Term Frequency - Inverse Document Frequency), Word Embedding (W2Vec, GloVe) and traditional Machine Learning techniques (Logistic Regression, Support Vector Machine and Naïve Bayes) in this context. A new dataset has been generated to be used as part of this study comprising of the user gender and associated tweets. This dataset was developed due to the unavailability of any public standard dataset with the volume required to perform this investigation. The results have determined that the traditional Bag of Words model did not provide any significant results in classification. However, word embedding models have significantly performed better using multiple machine learning techniques. Therefore, the word embedding models have been proven to be the most effective technique in classifying gender based on twitter text data.","PeriodicalId":351013,"journal":{"name":"2020 31st Irish Signals and Systems Conference (ISSC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 31st Irish Signals and Systems Conference (ISSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSC49989.2020.9180161","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Increasingly content sharing websites such as social media have become very popular in many countries across the world. Classifying the gender of a person based on these short messages is an interesting research area that could benefit legal investigation, forensics, marketing analysis, advertising and recommendation. This research will explore the use of Natural Language Processing (NLP) techniques and tweets in a gender classification system. This investigation will compare multiple techniques such as Bag of Words (Term Frequency - Inverse Document Frequency), Word Embedding (W2Vec, GloVe) and traditional Machine Learning techniques (Logistic Regression, Support Vector Machine and Naïve Bayes) in this context. A new dataset has been generated to be used as part of this study comprising of the user gender and associated tweets. This dataset was developed due to the unavailability of any public standard dataset with the volume required to perform this investigation. The results have determined that the traditional Bag of Words model did not provide any significant results in classification. However, word embedding models have significantly performed better using multiple machine learning techniques. Therefore, the word embedding models have been proven to be the most effective technique in classifying gender based on twitter text data.

查看原文本刊更多论文

使用Twitter文本数据进行性别分类

越来越多的内容分享网站，如社交媒体，在世界上许多国家变得非常流行。根据这些短信对一个人的性别进行分类是一个有趣的研究领域，它可能有利于法律调查、法医、营销分析、广告和推荐。本研究将探索在性别分类系统中使用自然语言处理(NLP)技术和tweet。本研究将在此背景下比较多种技术，如词袋(词频-逆文档频率)，词嵌入(W2Vec, GloVe)和传统的机器学习技术(逻辑回归，支持向量机和Naïve贝叶斯)。一个由用户性别和相关推文组成的新数据集已被生成，作为本研究的一部分。由于没有任何公共标准数据集具有执行此调查所需的容量，因此开发了此数据集。结果表明，传统的词袋模型在分类上并没有提供任何显著的结果。然而，使用多种机器学习技术，词嵌入模型的表现明显更好。因此，词嵌入模型已被证明是基于twitter文本数据进行性别分类最有效的技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 31st Irish Signals and Systems Conference (ISSC)

自引率

0.00%

发文量