{"title":"泰米尔语和马拉雅拉姆语希望语音检测的跨语言迁移学习新方法","authors":"Jothi Prakash V., Arul Antran Vijay S.","doi":"10.1016/j.csl.2025.101870","DOIUrl":null,"url":null,"abstract":"<div><div>In the field of Natural Language Processing (NLP), accurately identifying hope speech in low-resource languages such as Tamil and Malayalam poses significant challenges. This research introduces the Sentimix Transformer (SentT), a novel transformer-based model designed for detecting hope speech in YouTube comments composed in Tamil and Malayalam, two linguistically rich but computationally low-resource languages. The SentT model innovatively combines multilingual BERT (mBERT) embeddings with specialized cultural and code-mixing adaptations to effectively process the linguistic diversity and complexities inherent in code-mixed data. This approach allows SentT to capture nuanced expressions of hope by integrating domain-specific knowledge into the transformer framework. Our methodology extends traditional transformer architectures by incorporating a unique ensemble of embeddings that encapsulate linguistic, cultural, and code-mixing attributes, significantly enhancing the model’s sensitivity to context and cultural idioms. We validate our approach using the Hope Speech dataset for Equality, Diversity, and Inclusion (HopeEDI), which includes diverse comments from social media. The SentT model achieves an impressive accuracy of 93.4%, a precision of 92.7%, and a recall of 94.1% outperforming existing models and demonstrating its efficacy in handling the subtleties of hope speech in multilingual contexts. The model’s architecture and the results of extensive evaluations not only underscore its effectiveness but also its potential as a scalable solution for similar tasks in other low-resource languages. Through this research, we contribute to the broader field of sentiment analysis by demonstrating the potential of tailored, context-aware models in enhancing digital communication’s positivity and inclusiveness.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101870"},"PeriodicalIF":3.4000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A novel approach to cross-linguistic transfer learning for hope speech detection in Tamil and Malayalam\",\"authors\":\"Jothi Prakash V., Arul Antran Vijay S.\",\"doi\":\"10.1016/j.csl.2025.101870\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In the field of Natural Language Processing (NLP), accurately identifying hope speech in low-resource languages such as Tamil and Malayalam poses significant challenges. This research introduces the Sentimix Transformer (SentT), a novel transformer-based model designed for detecting hope speech in YouTube comments composed in Tamil and Malayalam, two linguistically rich but computationally low-resource languages. The SentT model innovatively combines multilingual BERT (mBERT) embeddings with specialized cultural and code-mixing adaptations to effectively process the linguistic diversity and complexities inherent in code-mixed data. This approach allows SentT to capture nuanced expressions of hope by integrating domain-specific knowledge into the transformer framework. Our methodology extends traditional transformer architectures by incorporating a unique ensemble of embeddings that encapsulate linguistic, cultural, and code-mixing attributes, significantly enhancing the model’s sensitivity to context and cultural idioms. We validate our approach using the Hope Speech dataset for Equality, Diversity, and Inclusion (HopeEDI), which includes diverse comments from social media. The SentT model achieves an impressive accuracy of 93.4%, a precision of 92.7%, and a recall of 94.1% outperforming existing models and demonstrating its efficacy in handling the subtleties of hope speech in multilingual contexts. The model’s architecture and the results of extensive evaluations not only underscore its effectiveness but also its potential as a scalable solution for similar tasks in other low-resource languages. Through this research, we contribute to the broader field of sentiment analysis by demonstrating the potential of tailored, context-aware models in enhancing digital communication’s positivity and inclusiveness.</div></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"96 \",\"pages\":\"Article 101870\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-08-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230825000956\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000956","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
A novel approach to cross-linguistic transfer learning for hope speech detection in Tamil and Malayalam
In the field of Natural Language Processing (NLP), accurately identifying hope speech in low-resource languages such as Tamil and Malayalam poses significant challenges. This research introduces the Sentimix Transformer (SentT), a novel transformer-based model designed for detecting hope speech in YouTube comments composed in Tamil and Malayalam, two linguistically rich but computationally low-resource languages. The SentT model innovatively combines multilingual BERT (mBERT) embeddings with specialized cultural and code-mixing adaptations to effectively process the linguistic diversity and complexities inherent in code-mixed data. This approach allows SentT to capture nuanced expressions of hope by integrating domain-specific knowledge into the transformer framework. Our methodology extends traditional transformer architectures by incorporating a unique ensemble of embeddings that encapsulate linguistic, cultural, and code-mixing attributes, significantly enhancing the model’s sensitivity to context and cultural idioms. We validate our approach using the Hope Speech dataset for Equality, Diversity, and Inclusion (HopeEDI), which includes diverse comments from social media. The SentT model achieves an impressive accuracy of 93.4%, a precision of 92.7%, and a recall of 94.1% outperforming existing models and demonstrating its efficacy in handling the subtleties of hope speech in multilingual contexts. The model’s architecture and the results of extensive evaluations not only underscore its effectiveness but also its potential as a scalable solution for similar tasks in other low-resource languages. Through this research, we contribute to the broader field of sentiment analysis by demonstrating the potential of tailored, context-aware models in enhancing digital communication’s positivity and inclusiveness.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.