Toxic comment classification and rationale extraction in code-mixed text leveraging co-attentive multi-task learning

IF 1.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
Kiran Babu Nelatoori, Hima Bindu Kommanti
{"title":"Toxic comment classification and rationale extraction in code-mixed text leveraging co-attentive multi-task learning","authors":"Kiran Babu Nelatoori, Hima Bindu Kommanti","doi":"10.1007/s10579-023-09708-6","DOIUrl":null,"url":null,"abstract":"<p>Detecting toxic comments and rationale for the offensiveness of a social media post promotes moderation of social media content. For this purpose, we propose a Co-Attentive Multi-task Learning (CA-MTL) model through transfer learning for low-resource Hindi-English (commonly known as Hinglish) toxic texts. Together, the cooperative tasks of rationale/span detection and toxic comment classification create a strong multi-task learning objective. A task collaboration module is designed to leverage the bi-directional attention between the classification and span prediction tasks. The combined loss function of the model is constructed using the individual loss functions of these two tasks. Although an English toxic span detection dataset exists, one for Hinglish code-mixed text does not exist as of today. Hence, we developed a dataset with toxic span annotations for Hinglish code-mixed text. The proposed CA-MTL model is compared against single-task and multi-task learning models that lack the co-attention mechanism, using multilingual and Hinglish BERT variants. The F1 scores of the proposed CA-MTL model with HingRoBERTa encoder for both tasks are significantly higher than the baseline models. <i>Caution:</i> This paper may contain words disturbing to some readers.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"27 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-023-09708-6","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Detecting toxic comments and rationale for the offensiveness of a social media post promotes moderation of social media content. For this purpose, we propose a Co-Attentive Multi-task Learning (CA-MTL) model through transfer learning for low-resource Hindi-English (commonly known as Hinglish) toxic texts. Together, the cooperative tasks of rationale/span detection and toxic comment classification create a strong multi-task learning objective. A task collaboration module is designed to leverage the bi-directional attention between the classification and span prediction tasks. The combined loss function of the model is constructed using the individual loss functions of these two tasks. Although an English toxic span detection dataset exists, one for Hinglish code-mixed text does not exist as of today. Hence, we developed a dataset with toxic span annotations for Hinglish code-mixed text. The proposed CA-MTL model is compared against single-task and multi-task learning models that lack the co-attention mechanism, using multilingual and Hinglish BERT variants. The F1 scores of the proposed CA-MTL model with HingRoBERTa encoder for both tasks are significantly higher than the baseline models. Caution: This paper may contain words disturbing to some readers.

Abstract Image

利用共同关注多任务学习在代码混合文本中进行有毒评论分类和理由提取
检测有毒评论和社交媒体帖子的攻击性理由可以促进对社交媒体内容的管理。为此,我们通过迁移学习为低资源印地语-英语(俗称兴英语)有毒文本提出了一种协同多任务学习(CA-MTL)模型。合理性/泛读检测和有毒评论分类这两项合作任务共同创造了一个强大的多任务学习目标。我们设计了一个任务协作模块,以利用分类任务和跨度预测任务之间的双向注意力。模型的综合损失函数是利用这两个任务的单独损失函数构建的。虽然存在英语毒性跨度检测数据集,但到目前为止还没有针对兴英语码混合文本的数据集。因此,我们为混合英语代码文本开发了一个带有有毒跨度注释的数据集。我们使用多语种和兴英语 BERT 变体,将所提出的 CA-MTL 模型与缺乏共同关注机制的单任务和多任务学习模型进行了比较。使用 HingRoBERTa 编码器的 CA-MTL 模型在这两项任务中的 F1 分数都明显高于基线模型。注意事项本文可能包含对某些读者造成困扰的词语。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Language Resources and Evaluation
Language Resources and Evaluation 工程技术-计算机:跨学科应用
CiteScore
6.50
自引率
3.70%
发文量
55
审稿时长
>12 weeks
期刊介绍: Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信