SWSR: A Chinese dataset and lexicon for online sexism detection

Q1 Social Sciences

Online Social Networks and Media Pub Date : 2022-01-01 DOI:10.1016/j.osnem.2021.100182

Aiqi Jiang , Xiaohan Yang , Yang Liu , Arkaitz Zubiaga

{"title":"SWSR: A Chinese dataset and lexicon for online sexism detection","authors":"Aiqi Jiang , Xiaohan Yang , Yang Liu , Arkaitz Zubiaga","doi":"10.1016/j.osnem.2021.100182","DOIUrl":null,"url":null,"abstract":"<div>Online sexism has become an increasing concern in social media platforms as it has affected the healthy development of the Internet and can have negative effects in society. While research in the sexism detection domain is growing, most of this research focuses on English as the language and on Twitter as the platform. Our objective here is to broaden the scope of this research by considering the Chinese language on Sina Weibo. We propose the first Chinese sexism dataset – Sina Weibo Sexism Review (SWSR) dataset –, as well as a large Chinese lexicon SexHateLex made of abusive and gender-related terms. We introduce our data collection and annotation process, and provide an exploratory analysis of the dataset characteristics to validate its quality and to show how sexism is manifested in Chinese. The SWSR dataset provides labels at different levels of granularity including (i) sexism or non-sexism, (ii) sexism category and (iii) target type, which can be exploited, among others, for building computational methods to identify and investigate finer-grained gender-related abusive language. We conduct experiments for the three sexism classification tasks making use of state-of-the-art machine learning models. Our results show competitive performance, providing a benchmark for sexism detection in the Chinese language, as well as an error analysis highlighting open challenges needing more research in Chinese NLP. The SWSR dataset and SexHateLex lexicon are publicly available.1</div>","PeriodicalId":52228,"journal":{"name":"Online Social Networks and Media","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Online Social Networks and Media","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2468696421000604","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Social Sciences","Score":null,"Total":0}

引用次数: 31

Abstract

Online sexism has become an increasing concern in social media platforms as it has affected the healthy development of the Internet and can have negative effects in society. While research in the sexism detection domain is growing, most of this research focuses on English as the language and on Twitter as the platform. Our objective here is to broaden the scope of this research by considering the Chinese language on Sina Weibo. We propose the first Chinese sexism dataset – Sina Weibo Sexism Review (SWSR) dataset –, as well as a large Chinese lexicon SexHateLex made of abusive and gender-related terms. We introduce our data collection and annotation process, and provide an exploratory analysis of the dataset characteristics to validate its quality and to show how sexism is manifested in Chinese. The SWSR dataset provides labels at different levels of granularity including (i) sexism or non-sexism, (ii) sexism category and (iii) target type, which can be exploited, among others, for building computational methods to identify and investigate finer-grained gender-related abusive language. We conduct experiments for the three sexism classification tasks making use of state-of-the-art machine learning models. Our results show competitive performance, providing a benchmark for sexism detection in the Chinese language, as well as an error analysis highlighting open challenges needing more research in Chinese NLP. The SWSR dataset and SexHateLex lexicon are publicly available.¹

查看原文本刊更多论文

网络性别歧视检测的中文数据集和词典

网络性别歧视已经成为社交媒体平台日益关注的问题，因为它影响了互联网的健康发展，并可能对社会产生负面影响。虽然性别歧视检测领域的研究正在增长，但大多数研究都集中在英语作为语言和Twitter作为平台上。我们的目标是通过考虑新浪微博上的中文来扩大这项研究的范围。我们提出了第一个中文性别歧视数据集——新浪微博性别歧视评论(SWSR)数据集——以及一个由辱骂和性别相关术语组成的大型中文词汇SexHateLex。我们介绍了我们的数据收集和注释过程，并对数据集特征进行了探索性分析，以验证其质量，并展示性别歧视在中文中的表现。SWSR数据集提供了不同粒度级别的标签，包括(i)性别歧视或非性别歧视，(ii)性别歧视类别和(iii)目标类型，这些标签可以用于构建计算方法，以识别和调查更细粒度的与性别相关的辱骂语言。我们利用最先进的机器学习模型对三个性别歧视分类任务进行了实验。我们的研究结果显示了具有竞争力的表现，为汉语中的性别歧视检测提供了基准，同时也为汉语NLP中需要更多研究的开放挑战提供了错误分析。SWSR数据集和SexHateLex词典是公开可用的

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Online Social Networks and Media Social Sciences-Communication

CiteScore

10.60

自引率

0.00%

发文量

审稿时长

44 days