Topic Words-Based Multilingual Hateful Linguistic Resources Construction for Developing Multilingual Hateful Content Detection Model Using Deep Learning Technique

IF 2.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

IET Information Security Pub Date : 2025-04-10 DOI:10.1049/ise2/6068177

Naol Bakala Defersha, Kula Kekeba Tune, Solomon Teferra Abate

{"title":"Topic Words-Based Multilingual Hateful Linguistic Resources Construction for Developing Multilingual Hateful Content Detection Model Using Deep Learning Technique","authors":"Naol Bakala Defersha, Kula Kekeba Tune, Solomon Teferra Abate","doi":"10.1049/ise2/6068177","DOIUrl":null,"url":null,"abstract":"<p>Nowadays, social media platforms provide space that allows communication and sharing of various resources using a variety of natural languages in different cultural and multilingual aspects. Although this interconnectedness offers numerous benefits, it also exposes users to the risk of encountering offensive (OFFN) and harmful content, including hateful speech. In order to create a model for detecting hateful content in resource-rich languages, lexicons, word embedding, topic modeling, and transformer language models were applied. Low-resource languages, including Ethiopian languages, suffering in lack of such linguistic resources. Multilingual hateful content detection brings complex challenges due to cultural and linguistic varieties. The paper proposes a multilingual hateful content identification model using a transformer language model and hybrid lexicon techniques to enhance hateful content recognition in low-resource Ethiopian languages. First, hateful content disseminated on Facebook in Ethiopian-languages target was identified as (insult, identity hate, antagonistic, and threat) using topic modeling techniques. Then, we compiled different hateful terms from sources such as guidelines and proclamations related to the Ethiopian context. We created Ethiopian context-based transformer language models. We utilized topic words-based datasets to construct pretrained transformer language models and multilingual lexicons of major Ethiopian languages. Finally, their performance was compared by integrating them into deep learning-based low-resource Ethiopian languages’ hateful content detection framework. Among applied deep learning algorithms with Ethiopian language linguistic resources, word2vec-based multilingual lexicons with convolutional neural network (CNN) outperform than others. The result indicated that constructing topic words based multilingual word2vec lexicons outperformed than transformers language model based on topics modeling for low-resource Ethiopian languages, effectively produce the promising hate speech (HATE) detection approach of low-resource Ethiopian languages.</p>","PeriodicalId":50380,"journal":{"name":"IET Information Security","volume":"2025 1","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ise2/6068177","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Information Security","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2/6068177","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Nowadays, social media platforms provide space that allows communication and sharing of various resources using a variety of natural languages in different cultural and multilingual aspects. Although this interconnectedness offers numerous benefits, it also exposes users to the risk of encountering offensive (OFFN) and harmful content, including hateful speech. In order to create a model for detecting hateful content in resource-rich languages, lexicons, word embedding, topic modeling, and transformer language models were applied. Low-resource languages, including Ethiopian languages, suffering in lack of such linguistic resources. Multilingual hateful content detection brings complex challenges due to cultural and linguistic varieties. The paper proposes a multilingual hateful content identification model using a transformer language model and hybrid lexicon techniques to enhance hateful content recognition in low-resource Ethiopian languages. First, hateful content disseminated on Facebook in Ethiopian-languages target was identified as (insult, identity hate, antagonistic, and threat) using topic modeling techniques. Then, we compiled different hateful terms from sources such as guidelines and proclamations related to the Ethiopian context. We created Ethiopian context-based transformer language models. We utilized topic words-based datasets to construct pretrained transformer language models and multilingual lexicons of major Ethiopian languages. Finally, their performance was compared by integrating them into deep learning-based low-resource Ethiopian languages’ hateful content detection framework. Among applied deep learning algorithms with Ethiopian language linguistic resources, word2vec-based multilingual lexicons with convolutional neural network (CNN) outperform than others. The result indicated that constructing topic words based multilingual word2vec lexicons outperformed than transformers language model based on topics modeling for low-resource Ethiopian languages, effectively produce the promising hate speech (HATE) detection approach of low-resource Ethiopian languages.

Abstract Image

查看原文本刊更多论文

基于主题词的多语种仇恨语言资源构建——基于深度学习技术开发多语种仇恨内容检测模型

如今，社交媒体平台提供了使用不同文化和多语言方面的各种自然语言进行交流和共享各种资源的空间。尽管这种互联性提供了许多好处，但它也使用户面临遇到攻击性（OFFN）和有害内容（包括仇恨言论）的风险。为了在资源丰富的语言中创建一个检测仇恨内容的模型，应用了词汇、词嵌入、主题建模和转换语言模型。资源匮乏的语言，包括埃塞俄比亚语，都缺乏这种语言资源。由于文化和语言的多样性，多语言仇恨内容检测带来了复杂的挑战。本文提出了一种多语言仇恨内容识别模型，使用转换语言模型和混合词汇技术来增强资源匮乏的埃塞俄比亚语言中的仇恨内容识别。首先，使用主题建模技术将Facebook上以埃塞俄比亚语传播的仇恨内容确定为（侮辱、身份仇恨、敌对和威胁）。然后，我们从与埃塞俄比亚背景有关的指导方针和公告等来源汇编了不同的仇恨术语。我们创建了埃塞俄比亚基于上下文的转换语言模型。我们利用基于主题词的数据集构建预训练的转换语言模型和埃塞俄比亚主要语言的多语言词典。最后，通过将它们集成到基于深度学习的低资源埃塞俄比亚语言的仇恨内容检测框架中来比较它们的表现。在埃塞俄比亚语言资源应用的深度学习算法中，基于word2vec的卷积神经网络（CNN）多语言词典表现优于其他算法。结果表明，构建基于主题词的多语种word2vec词汇比基于主题建模的transformer语言模型在低资源埃塞俄比亚语中表现更好，有效地生成了低资源埃塞俄比亚语的仇恨言论（hate）检测方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IET Information Security 工程技术-计算机：理论方法

CiteScore

3.80

自引率

7.10%

发文量

审稿时长

8.6 months

期刊介绍： IET Information Security publishes original research papers in the following areas of information security and cryptography. Submitting authors should specify clearly in their covering statement the area into which their paper falls. Scope: Access Control and Database Security Ad-Hoc Network Aspects Anonymity and E-Voting Authentication Block Ciphers and Hash Functions Blockchain, Bitcoin (Technical aspects only) Broadcast Encryption and Traitor Tracing Combinatorial Aspects Covert Channels and Information Flow Critical Infrastructures Cryptanalysis Dependability Digital Rights Management Digital Signature Schemes Digital Steganography Economic Aspects of Information Security Elliptic Curve Cryptography and Number Theory Embedded Systems Aspects Embedded Systems Security and Forensics Financial Cryptography Firewall Security Formal Methods and Security Verification Human Aspects Information Warfare and Survivability Intrusion Detection Java and XML Security Key Distribution Key Management Malware Multi-Party Computation and Threshold Cryptography Peer-to-peer Security PKIs Public-Key and Hybrid Encryption Quantum Cryptography Risks of using Computers Robust Networks Secret Sharing Secure Electronic Commerce Software Obfuscation Stream Ciphers Trust Models Watermarking and Fingerprinting Special Issues. Current Call for Papers: Security on Mobile and IoT devices - https://digital-library.theiet.org/files/IET_IFS_SMID_CFP.pdf