Leveraging machine translation for cross-lingual fine-grained cyberbullying classification amongst pre-adolescents

IF 1.9 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering Pub Date : 2022-09-07 DOI:10.1017/s1351324922000341

Kanishk Verma, Maja Popovic, Alexandros Poulis, Y. Cherkasova, Cathal Ó hÓbáin, A. Mazzone, Tijana Milosevic, Brian Davis

{"title":"Leveraging machine translation for cross-lingual fine-grained cyberbullying classification amongst pre-adolescents","authors":"Kanishk Verma, Maja Popovic, Alexandros Poulis, Y. Cherkasova, Cathal Ó hÓbáin, A. Mazzone, Tijana Milosevic, Brian Davis","doi":"10.1017/s1351324922000341","DOIUrl":null,"url":null,"abstract":"\n Cyberbullying is the wilful and repeated infliction of harm on an individual using the Internet and digital technologies. Similar to face-to-face bullying, cyberbullying can be captured formally using the Routine Activities Model (RAM) whereby the potential victim and bully are brought into proximity of one another via the interaction on online social networking (OSN) platforms. Although the impact of the COVID-19 (SARS-CoV-2) restrictions on the online presence of minors has yet to be fully grasped, studies have reported that 44% of pre-adolescents have encountered more cyberbullying incidents during the COVID-19 lockdown. Transparency reports shared by OSN companies indicate an increased take-downs of cyberbullying-related comments, posts or content by artificially intelligen moderation tools. However, in order to efficiently and effectively detect or identify whether a social media post or comment qualifies as cyberbullying, there are a number factors based on the RAM, which must be taken into account, which includes the identification of cyberbullying roles and forms. This demands the acquisition of large amounts of fine-grained annotated data which is costly and ethically challenging to produce. In addition where fine-grained datasets do exist they may be unavailable in the target language. Manual translation is costly and expensive, however, state-of-the-art neural machine translation offers a workaround. This study presents a first of its kind experiment in leveraging machine translation to automatically translate a unique pre-adolescent cyberbullying gold standard dataset in Italian with fine-grained annotations into English for training and testing a native binary classifier for pre-adolescent cyberbullying. In addition to contributing high-quality English reference translation of the source gold standard, our experiments indicate that the performance of our target binary classifier when trained on machine-translated English output is on par with the source (Italian) classifier.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"1 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2022-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/s1351324922000341","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 1

Abstract

Cyberbullying is the wilful and repeated infliction of harm on an individual using the Internet and digital technologies. Similar to face-to-face bullying, cyberbullying can be captured formally using the Routine Activities Model (RAM) whereby the potential victim and bully are brought into proximity of one another via the interaction on online social networking (OSN) platforms. Although the impact of the COVID-19 (SARS-CoV-2) restrictions on the online presence of minors has yet to be fully grasped, studies have reported that 44% of pre-adolescents have encountered more cyberbullying incidents during the COVID-19 lockdown. Transparency reports shared by OSN companies indicate an increased take-downs of cyberbullying-related comments, posts or content by artificially intelligen moderation tools. However, in order to efficiently and effectively detect or identify whether a social media post or comment qualifies as cyberbullying, there are a number factors based on the RAM, which must be taken into account, which includes the identification of cyberbullying roles and forms. This demands the acquisition of large amounts of fine-grained annotated data which is costly and ethically challenging to produce. In addition where fine-grained datasets do exist they may be unavailable in the target language. Manual translation is costly and expensive, however, state-of-the-art neural machine translation offers a workaround. This study presents a first of its kind experiment in leveraging machine translation to automatically translate a unique pre-adolescent cyberbullying gold standard dataset in Italian with fine-grained annotations into English for training and testing a native binary classifier for pre-adolescent cyberbullying. In addition to contributing high-quality English reference translation of the source gold standard, our experiments indicate that the performance of our target binary classifier when trained on machine-translated English output is on par with the source (Italian) classifier.

查看原文本刊更多论文

利用机器翻译对学龄前青少年进行跨语言细粒度网络欺凌分类

网络欺凌是指利用互联网和数字技术故意反复对个人造成伤害。与面对面欺凌类似，可以使用日常活动模型（RAM）正式捕捉网络欺凌，通过在线社交网络（OSN）平台上的互动，将潜在的受害者和欺凌者拉近距离。尽管新冠肺炎（SARS-CoV-2）限制对未成年人在线的影响尚未完全掌握，但研究报告称，44%的学龄前青少年在新冠肺炎封锁期间遇到了更多的网络欺凌事件。OSN公司分享的透明度报告表明，人工智能审核工具越来越多地删除与网络欺凌相关的评论、帖子或内容。然而，为了有效地检测或识别社交媒体帖子或评论是否符合网络欺凌的条件，必须考虑基于RAM的许多因素，其中包括识别网络欺凌的角色和形式。这需要获取大量细粒度的注释数据，这是一项成本高昂且在道德上具有挑战性的工作。此外，在确实存在细粒度数据集的情况下，它们在目标语言中可能不可用。人工翻译成本高昂，但最先进的神经机器翻译提供了一种解决方法。这项研究首次利用机器翻译将一个具有细粒度注释的独特的青春期前网络欺凌金标准意大利语数据集自动翻译成英语，用于训练和测试青春期前网络霸凌的原生二元分类器。除了贡献源黄金标准的高质量英语参考翻译外，我们的实验表明，当在机器翻译的英语输出上训练时，我们的目标二进制分类器的性能与源（意大利语）分类器不相上下。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

5.90

自引率

12.00%

发文量

审稿时长

>12 weeks

期刊介绍： Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.