A Data Viz Platform as a Support to Study, Analyze and Understand the Hate Speech Phenomenon

A. Capozzi, V. Patti, G. Ruffo, C. Bosco
{"title":"A Data Viz Platform as a Support to Study, Analyze and Understand the Hate Speech Phenomenon","authors":"A. Capozzi, V. Patti, G. Ruffo, C. Bosco","doi":"10.1145/3240431.3240437","DOIUrl":null,"url":null,"abstract":"In this paper we present a data visualization platform designed to support the Natural Language Processing (NLP) scholar to study and analyze different corpora collected with the purpose to understand the hate speech phenomenon in social media. The project started with the creation of a corpus which collects tweets addressed to specific groups of ethnic minorities considered very controversial in the Italian public debate. Each tweet has been manually tagged with a series of attributes in order to capture the different features used to characterize the hate speech phenomenon. This corpus is mainly built to be used for training an automatic classifier and helping us in its testing and validation, before being it adopted to detect tweets targeted as hate speech on larger scale datasets. As opposed as many other traditional machine learning tasks, to build a good classifier achieving high scores in terms of accuracy is very challenging in such scenario, because of the intrinsic ambiguity of the language, the lack of a proper and explicable context in social media, and the attitude of on line users of being sarcastic and ironical. Therefore, in order to properly validate an effective feature selection process, correlations between selected attributes must be studied and analyzed. This motivated us to build an interactive platform to explore data in our corpora across the dimensions that have been used to characterize collected tweets. In our paper, after a brief introduction of the hate speech dataset, we will show how the dashboard can fit into the NLP pipeline, and how its architecture can be structured. Finally, we will present some of the challenges we have faced to visualize data with spatial, temporal and numerical attributes.","PeriodicalId":147028,"journal":{"name":"Proceedings of the 2nd International Conference on Web Studies","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd International Conference on Web Studies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3240431.3240437","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

In this paper we present a data visualization platform designed to support the Natural Language Processing (NLP) scholar to study and analyze different corpora collected with the purpose to understand the hate speech phenomenon in social media. The project started with the creation of a corpus which collects tweets addressed to specific groups of ethnic minorities considered very controversial in the Italian public debate. Each tweet has been manually tagged with a series of attributes in order to capture the different features used to characterize the hate speech phenomenon. This corpus is mainly built to be used for training an automatic classifier and helping us in its testing and validation, before being it adopted to detect tweets targeted as hate speech on larger scale datasets. As opposed as many other traditional machine learning tasks, to build a good classifier achieving high scores in terms of accuracy is very challenging in such scenario, because of the intrinsic ambiguity of the language, the lack of a proper and explicable context in social media, and the attitude of on line users of being sarcastic and ironical. Therefore, in order to properly validate an effective feature selection process, correlations between selected attributes must be studied and analyzed. This motivated us to build an interactive platform to explore data in our corpora across the dimensions that have been used to characterize collected tweets. In our paper, after a brief introduction of the hate speech dataset, we will show how the dashboard can fit into the NLP pipeline, and how its architecture can be structured. Finally, we will present some of the challenges we have faced to visualize data with spatial, temporal and numerical attributes.
基于数据可视化平台的仇恨言论现象研究、分析和理解
在本文中,我们提出了一个数据可视化平台,旨在支持自然语言处理(NLP)学者研究和分析收集的不同语料库,以了解社交媒体中的仇恨言论现象。该项目从创建一个语料库开始,该语料库收集针对意大利公共辩论中被认为非常有争议的特定少数民族群体的推文。每条推文都被手工标记了一系列属性,以便捕捉到用来描述仇恨言论现象的不同特征。该语料库主要用于训练自动分类器并帮助我们进行测试和验证,然后将其用于在更大规模的数据集中检测针对仇恨言论的推文。与许多其他传统的机器学习任务相反,在这种情况下,由于语言本身的模糊性,社交媒体中缺乏适当和可解释的上下文,以及在线用户的讽刺和讽刺态度,构建一个在准确率方面取得高分的优秀分类器是非常具有挑战性的。因此,为了正确验证有效的特征选择过程,必须研究和分析所选属性之间的相关性。这促使我们建立一个互动平台,以跨维度探索我们的语料库中的数据,这些数据已被用于描述收集到的推文。在我们的论文中,在简要介绍了仇恨言论数据集之后,我们将展示如何将仪表板融入NLP管道,以及如何构建其架构。最后,我们将介绍我们在可视化具有空间、时间和数值属性的数据方面所面临的一些挑战。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信