维基百科中网络滥用的自动检测和问题用户的分析

Charu Rawat, Arnab Sarkar, Sameer Singh, Raf Alvarado, Lane Rasberry
{"title":"维基百科中网络滥用的自动检测和问题用户的分析","authors":"Charu Rawat, Arnab Sarkar, Sameer Singh, Raf Alvarado, Lane Rasberry","doi":"10.1109/SIEDS.2019.8735592","DOIUrl":null,"url":null,"abstract":"Today's digital landscape is characterized by the pervasive presence of online communities. One of the persistent challenges to the ideal of free-flowing discourse in these communities has been online abuse. Wikipedia is a case in point, as it's large community of contributors have experienced the perils of online abuse ranging from hateful speech to personal attacks to spam. Currently, Wikipedia has a human-driven process in place to identify online abuse. In this paper, we propose a framework to understand and detect such abuse in the English Wikipedia community. We analyze the publicly available data sources provided by Wikipedia. We discover that Wikipedia's XML dumps require extensive computing power to be used for temporal textual analysis, and, as an alternative, we propose a web scraping methodology to extract user-level data and perform extensive exploratory data analysis to understand the characteristics of users who have been blocked for abusive behavior in the past. With these data, we develop an abuse detection model that leverages Natural Language Processing techniques, such as character and word n-grams, sentiment analysis and topic modeling, and generates features that are used as inputs in a model based on machine learning algorithms to predict abusive behavior. Our best abuse detection model, using XGBoost Classifier, gives us an AUC of ∼84%.","PeriodicalId":265421,"journal":{"name":"2019 Systems and Information Engineering Design Symposium (SIEDS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2019-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Automatic Detection of Online Abuse and Analysis of Problematic Users in Wikipedia\",\"authors\":\"Charu Rawat, Arnab Sarkar, Sameer Singh, Raf Alvarado, Lane Rasberry\",\"doi\":\"10.1109/SIEDS.2019.8735592\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Today's digital landscape is characterized by the pervasive presence of online communities. One of the persistent challenges to the ideal of free-flowing discourse in these communities has been online abuse. Wikipedia is a case in point, as it's large community of contributors have experienced the perils of online abuse ranging from hateful speech to personal attacks to spam. Currently, Wikipedia has a human-driven process in place to identify online abuse. In this paper, we propose a framework to understand and detect such abuse in the English Wikipedia community. We analyze the publicly available data sources provided by Wikipedia. We discover that Wikipedia's XML dumps require extensive computing power to be used for temporal textual analysis, and, as an alternative, we propose a web scraping methodology to extract user-level data and perform extensive exploratory data analysis to understand the characteristics of users who have been blocked for abusive behavior in the past. With these data, we develop an abuse detection model that leverages Natural Language Processing techniques, such as character and word n-grams, sentiment analysis and topic modeling, and generates features that are used as inputs in a model based on machine learning algorithms to predict abusive behavior. Our best abuse detection model, using XGBoost Classifier, gives us an AUC of ∼84%.\",\"PeriodicalId\":265421,\"journal\":{\"name\":\"2019 Systems and Information Engineering Design Symposium (SIEDS)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 Systems and Information Engineering Design Symposium (SIEDS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIEDS.2019.8735592\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS.2019.8735592","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

摘要

今天的数字景观的特点是无处不在的在线社区。在这些社区中,言论自由流通的理想一直面临的挑战之一是网络滥用。维基百科就是一个很好的例子,因为它庞大的贡献者社区经历了网络滥用的危险,从仇恨言论到个人攻击再到垃圾邮件。目前,维基百科有一个人为驱动的过程来识别网络滥用。在本文中,我们提出了一个框架来理解和检测英语维基百科社区中的这种滥用。我们分析维基百科提供的公开可用数据源。我们发现维基百科的XML转储需要大量的计算能力来进行时间文本分析,作为替代方案,我们提出了一种网络抓取方法来提取用户级数据,并执行广泛的探索性数据分析,以了解过去因滥用行为而被屏蔽的用户的特征。利用这些数据,我们开发了一个滥用检测模型,该模型利用自然语言处理技术,如字符和单词n图、情感分析和主题建模,并生成特征,这些特征用作基于机器学习算法的模型的输入,以预测滥用行为。我们最好的滥用检测模型,使用XGBoost分类器,为我们提供了约84%的AUC。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Automatic Detection of Online Abuse and Analysis of Problematic Users in Wikipedia
Today's digital landscape is characterized by the pervasive presence of online communities. One of the persistent challenges to the ideal of free-flowing discourse in these communities has been online abuse. Wikipedia is a case in point, as it's large community of contributors have experienced the perils of online abuse ranging from hateful speech to personal attacks to spam. Currently, Wikipedia has a human-driven process in place to identify online abuse. In this paper, we propose a framework to understand and detect such abuse in the English Wikipedia community. We analyze the publicly available data sources provided by Wikipedia. We discover that Wikipedia's XML dumps require extensive computing power to be used for temporal textual analysis, and, as an alternative, we propose a web scraping methodology to extract user-level data and perform extensive exploratory data analysis to understand the characteristics of users who have been blocked for abusive behavior in the past. With these data, we develop an abuse detection model that leverages Natural Language Processing techniques, such as character and word n-grams, sentiment analysis and topic modeling, and generates features that are used as inputs in a model based on machine learning algorithms to predict abusive behavior. Our best abuse detection model, using XGBoost Classifier, gives us an AUC of ∼84%.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信