侵略和偏见的多语言、多模态数据集:逗号数据集

IF 1.8 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation Pub Date : 2023-11-16 DOI:10.1007/s10579-023-09696-7

Ritesh Kumar, Shyam Ratan, Siddharth Singh, Enakshi Nandi, Laishram Niranjana Devi, Akash Bhagat, Yogesh Dawer, Bornini Lahiri, Akanksha Bansal

{"title":"侵略和偏见的多语言、多模态数据集:逗号数据集","authors":"Ritesh Kumar, Shyam Ratan, Siddharth Singh, Enakshi Nandi, Laishram Niranjana Devi, Akash Bhagat, Yogesh Dawer, Bornini Lahiri, Akanksha Bansal","doi":"10.1007/s10579-023-09696-7","DOIUrl":null,"url":null,"abstract":"<p>In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the “context\" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the “type” of discursive role that the comment is performing with respect to the previous comment(s). The dataset has been developed as part of the ComMA Project and consists of a total of 57,363 annotated comments, 1142 annotated memes, and around 70 h of annotated audio (extracted from videos) in four languages—Meitei, Bangla, Hindi, and Indian English. This data has been collected from various social media platforms such as YouTube, Facebook, Twitter, and Telegram. As is usual on social media websites, a large number of these comments are multilingual, and many are code-mixed with English. This paper gives a detailed description of the tagset developed during the course of this project and elaborates on the process of developing and using a multi-label, fine-grained tagset for marking comments with aggression and bias of various kinds, which includes gender bias, religious intolerance (called communal bias in the tagset), class/caste bias, and ethnic/racial bias. We define and discuss the tags that have been used for marking different discursive roles being performed through the comments, such as attack, defend, and so on. We also present a statistical analysis of the dataset as well as the results of our baseline experiments for developing an automatic aggression identification system using the dataset developed. Based on the results of the baseline experiments, we also argue that our dataset provides diverse and ‘hard’ sets of instances which makes it a good dataset for training and testing new techniques for aggressive and abusive language classification.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"77 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A multilingual, multimodal dataset of aggression and bias: the ComMA dataset\",\"authors\":\"Ritesh Kumar, Shyam Ratan, Siddharth Singh, Enakshi Nandi, Laishram Niranjana Devi, Akash Bhagat, Yogesh Dawer, Bornini Lahiri, Akanksha Bansal\",\"doi\":\"10.1007/s10579-023-09696-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the “context\\\" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the “type” of discursive role that the comment is performing with respect to the previous comment(s). The dataset has been developed as part of the ComMA Project and consists of a total of 57,363 annotated comments, 1142 annotated memes, and around 70 h of annotated audio (extracted from videos) in four languages—Meitei, Bangla, Hindi, and Indian English. This data has been collected from various social media platforms such as YouTube, Facebook, Twitter, and Telegram. As is usual on social media websites, a large number of these comments are multilingual, and many are code-mixed with English. This paper gives a detailed description of the tagset developed during the course of this project and elaborates on the process of developing and using a multi-label, fine-grained tagset for marking comments with aggression and bias of various kinds, which includes gender bias, religious intolerance (called communal bias in the tagset), class/caste bias, and ethnic/racial bias. We define and discuss the tags that have been used for marking different discursive roles being performed through the comments, such as attack, defend, and so on. We also present a statistical analysis of the dataset as well as the results of our baseline experiments for developing an automatic aggression identification system using the dataset developed. Based on the results of the baseline experiments, we also argue that our dataset provides diverse and ‘hard’ sets of instances which makes it a good dataset for training and testing new techniques for aggressive and abusive language classification.</p>\",\"PeriodicalId\":49927,\"journal\":{\"name\":\"Language Resources and Evaluation\",\"volume\":\"77 1\",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2023-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Language Resources and Evaluation\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s10579-023-09696-7\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-023-09696-7","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

在本文中，我们讨论了一个多语言数据集的开发，该数据集用一个分层的、细粒度的标记集来标记不同类型的攻击及其发生的“上下文”。在这里，上下文是由特定评论发生的会话线程以及评论相对于前一个评论所扮演的话语角色的“类型”来定义的。该数据集是作为逗号项目的一部分开发的，由总共57,363条带注释的评论、1142条带注释的模因和大约70小时的带注释的音频(从视频中提取)组成，包括四种语言——美泰语、孟加拉语、印地语和印度英语。这些数据是从YouTube、Facebook、Twitter和Telegram等各种社交媒体平台收集的。与社交媒体网站上的常见情况一样，这些评论中有大量是多语言的，其中许多是英语代码混合的。本文详细描述了在这个项目过程中开发的标签集，并详细说明了开发和使用一个多标签、细粒度的标签集来标记带有各种侵略和偏见的评论的过程，这些偏见包括性别偏见、宗教不宽容(在标签集中称为社区偏见)、阶级/种姓偏见和民族/种族偏见。我们定义并讨论了用于标记通过注释执行的不同话语角色的标记，例如攻击、防御等等。我们还提出了数据集的统计分析，以及我们使用开发的数据集开发自动攻击识别系统的基线实验结果。基于基线实验的结果，我们还认为我们的数据集提供了多样化和“硬”的实例集，这使得它成为训练和测试攻击性和滥用性语言分类新技术的良好数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

A multilingual, multimodal dataset of aggression and bias: the ComMA dataset

查看原文本刊更多论文

A multilingual, multimodal dataset of aggression and bias: the ComMA dataset

In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the “context" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the “type” of discursive role that the comment is performing with respect to the previous comment(s). The dataset has been developed as part of the ComMA Project and consists of a total of 57,363 annotated comments, 1142 annotated memes, and around 70 h of annotated audio (extracted from videos) in four languages—Meitei, Bangla, Hindi, and Indian English. This data has been collected from various social media platforms such as YouTube, Facebook, Twitter, and Telegram. As is usual on social media websites, a large number of these comments are multilingual, and many are code-mixed with English. This paper gives a detailed description of the tagset developed during the course of this project and elaborates on the process of developing and using a multi-label, fine-grained tagset for marking comments with aggression and bias of various kinds, which includes gender bias, religious intolerance (called communal bias in the tagset), class/caste bias, and ethnic/racial bias. We define and discuss the tags that have been used for marking different discursive roles being performed through the comments, such as attack, defend, and so on. We also present a statistical analysis of the dataset as well as the results of our baseline experiments for developing an automatic aggression identification system using the dataset developed. Based on the results of the baseline experiments, we also argue that our dataset provides diverse and ‘hard’ sets of instances which makes it a good dataset for training and testing new techniques for aggressive and abusive language classification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Language Resources and Evaluation 工程技术-计算机：跨学科应用

CiteScore

6.50

自引率

3.70%

发文量

审稿时长

>12 weeks

期刊介绍： Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.