维基百科中网络滥用的自动检测和问题用户的分析

2019 Systems and Information Engineering Design Symposium (SIEDS) Pub Date : 2019-04-26 DOI:10.1109/SIEDS.2019.8735592

Charu Rawat, Arnab Sarkar, Sameer Singh, Raf Alvarado, Lane Rasberry

{"title":"维基百科中网络滥用的自动检测和问题用户的分析","authors":"Charu Rawat, Arnab Sarkar, Sameer Singh, Raf Alvarado, Lane Rasberry","doi":"10.1109/SIEDS.2019.8735592","DOIUrl":null,"url":null,"abstract":"Today's digital landscape is characterized by the pervasive presence of online communities. One of the persistent challenges to the ideal of free-flowing discourse in these communities has been online abuse. Wikipedia is a case in point, as it's large community of contributors have experienced the perils of online abuse ranging from hateful speech to personal attacks to spam. Currently, Wikipedia has a human-driven process in place to identify online abuse. In this paper, we propose a framework to understand and detect such abuse in the English Wikipedia community. We analyze the publicly available data sources provided by Wikipedia. We discover that Wikipedia's XML dumps require extensive computing power to be used for temporal textual analysis, and, as an alternative, we propose a web scraping methodology to extract user-level data and perform extensive exploratory data analysis to understand the characteristics of users who have been blocked for abusive behavior in the past. With these data, we develop an abuse detection model that leverages Natural Language Processing techniques, such as character and word n-grams, sentiment analysis and topic modeling, and generates features that are used as inputs in a model based on machine learning algorithms to predict abusive behavior. Our best abuse detection model, using XGBoost Classifier, gives us an AUC of ∼84%.","PeriodicalId":265421,"journal":{"name":"2019 Systems and Information Engineering Design Symposium (SIEDS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2019-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Automatic Detection of Online Abuse and Analysis of Problematic Users in Wikipedia\",\"authors\":\"Charu Rawat, Arnab Sarkar, Sameer Singh, Raf Alvarado, Lane Rasberry\",\"doi\":\"10.1109/SIEDS.2019.8735592\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Today's digital landscape is characterized by the pervasive presence of online communities. One of the persistent challenges to the ideal of free-flowing discourse in these communities has been online abuse. Wikipedia is a case in point, as it's large community of contributors have experienced the perils of online abuse ranging from hateful speech to personal attacks to spam. Currently, Wikipedia has a human-driven process in place to identify online abuse. In this paper, we propose a framework to understand and detect such abuse in the English Wikipedia community. We analyze the publicly available data sources provided by Wikipedia. We discover that Wikipedia's XML dumps require extensive computing power to be used for temporal textual analysis, and, as an alternative, we propose a web scraping methodology to extract user-level data and perform extensive exploratory data analysis to understand the characteristics of users who have been blocked for abusive behavior in the past. With these data, we develop an abuse detection model that leverages Natural Language Processing techniques, such as character and word n-grams, sentiment analysis and topic modeling, and generates features that are used as inputs in a model based on machine learning algorithms to predict abusive behavior. Our best abuse detection model, using XGBoost Classifier, gives us an AUC of ∼84%.\",\"PeriodicalId\":265421,\"journal\":{\"name\":\"2019 Systems and Information Engineering Design Symposium (SIEDS)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 Systems and Information Engineering Design Symposium (SIEDS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIEDS.2019.8735592\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS.2019.8735592","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

今天的数字景观的特点是无处不在的在线社区。在这些社区中，言论自由流通的理想一直面临的挑战之一是网络滥用。维基百科就是一个很好的例子，因为它庞大的贡献者社区经历了网络滥用的危险，从仇恨言论到个人攻击再到垃圾邮件。目前，维基百科有一个人为驱动的过程来识别网络滥用。在本文中，我们提出了一个框架来理解和检测英语维基百科社区中的这种滥用。我们分析维基百科提供的公开可用数据源。我们发现维基百科的XML转储需要大量的计算能力来进行时间文本分析，作为替代方案，我们提出了一种网络抓取方法来提取用户级数据，并执行广泛的探索性数据分析，以了解过去因滥用行为而被屏蔽的用户的特征。利用这些数据，我们开发了一个滥用检测模型，该模型利用自然语言处理技术，如字符和单词n图、情感分析和主题建模，并生成特征，这些特征用作基于机器学习算法的模型的输入，以预测滥用行为。我们最好的滥用检测模型，使用XGBoost分类器，为我们提供了约84%的AUC。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automatic Detection of Online Abuse and Analysis of Problematic Users in Wikipedia

Today's digital landscape is characterized by the pervasive presence of online communities. One of the persistent challenges to the ideal of free-flowing discourse in these communities has been online abuse. Wikipedia is a case in point, as it's large community of contributors have experienced the perils of online abuse ranging from hateful speech to personal attacks to spam. Currently, Wikipedia has a human-driven process in place to identify online abuse. In this paper, we propose a framework to understand and detect such abuse in the English Wikipedia community. We analyze the publicly available data sources provided by Wikipedia. We discover that Wikipedia's XML dumps require extensive computing power to be used for temporal textual analysis, and, as an alternative, we propose a web scraping methodology to extract user-level data and perform extensive exploratory data analysis to understand the characteristics of users who have been blocked for abusive behavior in the past. With these data, we develop an abuse detection model that leverages Natural Language Processing techniques, such as character and word n-grams, sentiment analysis and topic modeling, and generates features that are used as inputs in a model based on machine learning algorithms to predict abusive behavior. Our best abuse detection model, using XGBoost Classifier, gives us an AUC of ∼84%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 Systems and Information Engineering Design Symposium (SIEDS)

自引率

0.00%

发文量