A Large Labeled Corpus for Online Harassment Research

Proceedings of the 2017 ACM on Web Science Conference Pub Date : 2017-06-25 DOI:10.1145/3091478.3091509

J. Golbeck, Zahra Ashktorab, Rashad O. Banjo, Alexandra Berlinger, Siddharth Bhagwan, C. Buntain, Paul Cheakalos, Alicia A. Geller, Quint Gergory, R. Gnanasekaran, Raja Rajan Gunasekaran, K. Hoffman, Jenny Hottle, Vichita Jienjitlert, Shivika Khare, Ryan Lau, Marianna J. Martindale, Shalmali Naik, Heather L. Nixon, P. Ramachandran, Kristine M. Rogers, Lisa Rogers, Meghna Sardana Sarin, Gaurav Shahane, Jayanee Thanki, Priyanka Vengataraman, Zijian Wan, D. Wu

引用次数: 170

Abstract

A fundamental part of conducting cross-disciplinary web science research is having useful, high-quality datasets that provide value to studies across disciplines. In this paper, we introduce a large, hand-coded corpus of online harassment data. A team of researchers collaboratively developed a codebook using grounded theory and labeled 35,000 tweets. Our resulting dataset has roughly 15% positive harassment examples and 85% negative examples. This data is useful for training machine learning models, identifying textual and linguistic features of online harassment, and for studying the nature of harassing comments and the culture of trolling.

查看原文本刊更多论文

面向网络骚扰研究的大型标注语料库

进行跨学科网络科学研究的一个基本部分是拥有有用的、高质量的数据集，为跨学科研究提供价值。在本文中，我们介绍了一个大型的、手工编码的在线骚扰数据语料库。一组研究人员合作开发了一个基于理论的密码本，并标记了35,000条推文。我们得到的数据集大约有15%的正面骚扰案例和85%的负面骚扰案例。这些数据对于训练机器学习模型、识别在线骚扰的文本和语言特征，以及研究骚扰评论的性质和钓鱼文化都很有用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 ACM on Web Science Conference

自引率

0.00%

发文量