泰米尔语和马拉雅拉姆语希望语音检测的跨语言迁移学习新方法

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-08-18 DOI:10.1016/j.csl.2025.101870

Jothi Prakash V., Arul Antran Vijay S.

{"title":"泰米尔语和马拉雅拉姆语希望语音检测的跨语言迁移学习新方法","authors":"Jothi Prakash V., Arul Antran Vijay S.","doi":"10.1016/j.csl.2025.101870","DOIUrl":null,"url":null,"abstract":"<div><div>In the field of Natural Language Processing (NLP), accurately identifying hope speech in low-resource languages such as Tamil and Malayalam poses significant challenges. This research introduces the Sentimix Transformer (SentT), a novel transformer-based model designed for detecting hope speech in YouTube comments composed in Tamil and Malayalam, two linguistically rich but computationally low-resource languages. The SentT model innovatively combines multilingual BERT (mBERT) embeddings with specialized cultural and code-mixing adaptations to effectively process the linguistic diversity and complexities inherent in code-mixed data. This approach allows SentT to capture nuanced expressions of hope by integrating domain-specific knowledge into the transformer framework. Our methodology extends traditional transformer architectures by incorporating a unique ensemble of embeddings that encapsulate linguistic, cultural, and code-mixing attributes, significantly enhancing the model’s sensitivity to context and cultural idioms. We validate our approach using the Hope Speech dataset for Equality, Diversity, and Inclusion (HopeEDI), which includes diverse comments from social media. The SentT model achieves an impressive accuracy of 93.4%, a precision of 92.7%, and a recall of 94.1% outperforming existing models and demonstrating its efficacy in handling the subtleties of hope speech in multilingual contexts. The model’s architecture and the results of extensive evaluations not only underscore its effectiveness but also its potential as a scalable solution for similar tasks in other low-resource languages. Through this research, we contribute to the broader field of sentiment analysis by demonstrating the potential of tailored, context-aware models in enhancing digital communication’s positivity and inclusiveness.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101870"},"PeriodicalIF":3.4000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A novel approach to cross-linguistic transfer learning for hope speech detection in Tamil and Malayalam\",\"authors\":\"Jothi Prakash V., Arul Antran Vijay S.\",\"doi\":\"10.1016/j.csl.2025.101870\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In the field of Natural Language Processing (NLP), accurately identifying hope speech in low-resource languages such as Tamil and Malayalam poses significant challenges. This research introduces the Sentimix Transformer (SentT), a novel transformer-based model designed for detecting hope speech in YouTube comments composed in Tamil and Malayalam, two linguistically rich but computationally low-resource languages. The SentT model innovatively combines multilingual BERT (mBERT) embeddings with specialized cultural and code-mixing adaptations to effectively process the linguistic diversity and complexities inherent in code-mixed data. This approach allows SentT to capture nuanced expressions of hope by integrating domain-specific knowledge into the transformer framework. Our methodology extends traditional transformer architectures by incorporating a unique ensemble of embeddings that encapsulate linguistic, cultural, and code-mixing attributes, significantly enhancing the model’s sensitivity to context and cultural idioms. We validate our approach using the Hope Speech dataset for Equality, Diversity, and Inclusion (HopeEDI), which includes diverse comments from social media. The SentT model achieves an impressive accuracy of 93.4%, a precision of 92.7%, and a recall of 94.1% outperforming existing models and demonstrating its efficacy in handling the subtleties of hope speech in multilingual contexts. The model’s architecture and the results of extensive evaluations not only underscore its effectiveness but also its potential as a scalable solution for similar tasks in other low-resource languages. Through this research, we contribute to the broader field of sentiment analysis by demonstrating the potential of tailored, context-aware models in enhancing digital communication’s positivity and inclusiveness.</div></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"96 \",\"pages\":\"Article 101870\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-08-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230825000956\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000956","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在自然语言处理（NLP）领域，准确识别泰米尔语和马拉雅拉姆语等低资源语言中的希望语是一个重大挑战。本研究介绍了Sentimix Transformer (SentT)，这是一种新颖的基于变压器的模型，旨在检测YouTube评论中的泰米尔语和马拉雅拉姆语的希望言论，这两种语言丰富但计算资源少。SentT模型创新地将多语言BERT （mBERT）嵌入与专门的文化和代码混合适应相结合，以有效地处理代码混合数据中固有的语言多样性和复杂性。这种方法允许SentT通过将特定领域的知识集成到转换器框架中来捕获希望的细微表达。我们的方法扩展了传统的转换器体系结构，通过结合独特的嵌入集成来封装语言、文化和代码混合属性，显著提高了模型对上下文和文化习惯用法的敏感性。我们使用平等、多样性和包容性的希望演讲数据集（HopeEDI）验证了我们的方法，该数据集包括来自社交媒体的各种评论。SentT模型取得了令人印象深刻的93.4%的准确率，92.7%的精度和94.1%的召回率，超过了现有的模型，并证明了它在处理多语言环境中希望语音的微妙之处的有效性。该模型的体系结构和广泛评估的结果不仅强调了它的有效性，而且还强调了它作为其他低资源语言中类似任务的可扩展解决方案的潜力。通过这项研究，我们通过展示量身定制的情境感知模型在增强数字通信的积极性和包容性方面的潜力，为更广泛的情感分析领域做出了贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A novel approach to cross-linguistic transfer learning for hope speech detection in Tamil and Malayalam

In the field of Natural Language Processing (NLP), accurately identifying hope speech in low-resource languages such as Tamil and Malayalam poses significant challenges. This research introduces the Sentimix Transformer (SentT), a novel transformer-based model designed for detecting hope speech in YouTube comments composed in Tamil and Malayalam, two linguistically rich but computationally low-resource languages. The SentT model innovatively combines multilingual BERT (mBERT) embeddings with specialized cultural and code-mixing adaptations to effectively process the linguistic diversity and complexities inherent in code-mixed data. This approach allows SentT to capture nuanced expressions of hope by integrating domain-specific knowledge into the transformer framework. Our methodology extends traditional transformer architectures by incorporating a unique ensemble of embeddings that encapsulate linguistic, cultural, and code-mixing attributes, significantly enhancing the model’s sensitivity to context and cultural idioms. We validate our approach using the Hope Speech dataset for Equality, Diversity, and Inclusion (HopeEDI), which includes diverse comments from social media. The SentT model achieves an impressive accuracy of 93.4%, a precision of 92.7%, and a recall of 94.1% outperforming existing models and demonstrating its efficacy in handling the subtleties of hope speech in multilingual contexts. The model’s architecture and the results of extensive evaluations not only underscore its effectiveness but also its potential as a scalable solution for similar tasks in other low-resource languages. Through this research, we contribute to the broader field of sentiment analysis by demonstrating the potential of tailored, context-aware models in enhancing digital communication’s positivity and inclusiveness.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.