SWSR: A Chinese dataset and lexicon for online sexism detection

Q1 Social Sciences
Aiqi Jiang , Xiaohan Yang , Yang Liu , Arkaitz Zubiaga
{"title":"SWSR: A Chinese dataset and lexicon for online sexism detection","authors":"Aiqi Jiang ,&nbsp;Xiaohan Yang ,&nbsp;Yang Liu ,&nbsp;Arkaitz Zubiaga","doi":"10.1016/j.osnem.2021.100182","DOIUrl":null,"url":null,"abstract":"<div><p><span>Online sexism has become an increasing concern in social media platforms<span> as it has affected the healthy development of the Internet and can have negative effects in society. While research in the sexism detection domain is growing, most of this research focuses on English as the language and on Twitter as the platform. Our objective here is to broaden the scope of this research by considering the Chinese language on Sina Weibo. We propose the first Chinese sexism dataset – Sina Weibo Sexism Review (SWSR) dataset –, as well as a large Chinese lexicon SexHateLex made of abusive and gender-related terms. We introduce our data collection and annotation process, and provide an exploratory analysis of the dataset characteristics to validate its quality and to show how sexism is manifested in Chinese. The SWSR dataset provides labels at different levels of granularity<span><span> including (i) sexism or non-sexism, (ii) sexism category and (iii) target type, which can be exploited, among others, for building computational methods to identify and investigate finer-grained gender-related abusive language. We conduct experiments for the three sexism classification tasks making use of state-of-the-art </span>machine learning models. Our results show competitive performance, providing a benchmark for sexism detection in the Chinese language, as well as an error analysis highlighting open challenges needing more research in Chinese NLP. The SWSR dataset and SexHateLex lexicon are publicly available.</span></span></span><span><sup>1</sup></span></p></div>","PeriodicalId":52228,"journal":{"name":"Online Social Networks and Media","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Online Social Networks and Media","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2468696421000604","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Social Sciences","Score":null,"Total":0}
引用次数: 31

Abstract

Online sexism has become an increasing concern in social media platforms as it has affected the healthy development of the Internet and can have negative effects in society. While research in the sexism detection domain is growing, most of this research focuses on English as the language and on Twitter as the platform. Our objective here is to broaden the scope of this research by considering the Chinese language on Sina Weibo. We propose the first Chinese sexism dataset – Sina Weibo Sexism Review (SWSR) dataset –, as well as a large Chinese lexicon SexHateLex made of abusive and gender-related terms. We introduce our data collection and annotation process, and provide an exploratory analysis of the dataset characteristics to validate its quality and to show how sexism is manifested in Chinese. The SWSR dataset provides labels at different levels of granularity including (i) sexism or non-sexism, (ii) sexism category and (iii) target type, which can be exploited, among others, for building computational methods to identify and investigate finer-grained gender-related abusive language. We conduct experiments for the three sexism classification tasks making use of state-of-the-art machine learning models. Our results show competitive performance, providing a benchmark for sexism detection in the Chinese language, as well as an error analysis highlighting open challenges needing more research in Chinese NLP. The SWSR dataset and SexHateLex lexicon are publicly available.1

网络性别歧视检测的中文数据集和词典
网络性别歧视已经成为社交媒体平台日益关注的问题,因为它影响了互联网的健康发展,并可能对社会产生负面影响。虽然性别歧视检测领域的研究正在增长,但大多数研究都集中在英语作为语言和Twitter作为平台上。我们的目标是通过考虑新浪微博上的中文来扩大这项研究的范围。我们提出了第一个中文性别歧视数据集——新浪微博性别歧视评论(SWSR)数据集——以及一个由辱骂和性别相关术语组成的大型中文词汇SexHateLex。我们介绍了我们的数据收集和注释过程,并对数据集特征进行了探索性分析,以验证其质量,并展示性别歧视在中文中的表现。SWSR数据集提供了不同粒度级别的标签,包括(i)性别歧视或非性别歧视,(ii)性别歧视类别和(iii)目标类型,这些标签可以用于构建计算方法,以识别和调查更细粒度的与性别相关的辱骂语言。我们利用最先进的机器学习模型对三个性别歧视分类任务进行了实验。我们的研究结果显示了具有竞争力的表现,为汉语中的性别歧视检测提供了基准,同时也为汉语NLP中需要更多研究的开放挑战提供了错误分析。SWSR数据集和SexHateLex词典是公开可用的
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Online Social Networks and Media
Online Social Networks and Media Social Sciences-Communication
CiteScore
10.60
自引率
0.00%
发文量
32
审稿时长
44 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信