Cross-lingual embedding methods and applications: A systematic review for low-resourced scenarios

Thapelo Sindane , Vukosi Marivate , Abiodun Modupe
{"title":"Cross-lingual embedding methods and applications: A systematic review for low-resourced scenarios","authors":"Thapelo Sindane ,&nbsp;Vukosi Marivate ,&nbsp;Abiodun Modupe","doi":"10.1016/j.nlp.2025.100157","DOIUrl":null,"url":null,"abstract":"<div><div>The field of Natural Language Processing (NLP) has achieved significant success in various areas, such as developing large-scale datasets, algorithmic complexity, optimized computing capabilities, refined individual and community expertise, and more, particularly in languages such as English, French, and Spanish. However, such global north unilateral strides have inadvertently created a substantial representation bias towards many languages categorized as low-resourced languages, with the majority being African languages. As a result, rudimentary resources such as stopwords, lemmatizers, stemmers, and word embeddings, as well as advanced multilingual transformer-based models remain under-developed for these languages. Compounding these circumstances is the lack of insights surrounding the development of these resources in the low-resourced context (e.g., how to develop embeddings for morphologically rich languages). Looking back, research priorities aiming to create these resources, largely motivated by the high cost attached to remedying these issues shifted, leading to the rise of alternative methods such as cross-lingual transfer learning (CLTL). CLTL involves transferring domain knowledge gained from supervised training to a domain with limited supervision signals. This study conducts a systematic literature review of CLTL techniques, in the context of cross-lingual models and embeddings, looking at their mathematical foundations, application domains, evaluation metrics, languages covered, and the latest developments. The findings of this study offer valuable insights into the present scenario of CLTL techniques, identifying areas for future research and development to advance cross-lingual natural language processing applications specifically in low-resourced settings.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"12 ","pages":"Article 100157"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719125000330","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The field of Natural Language Processing (NLP) has achieved significant success in various areas, such as developing large-scale datasets, algorithmic complexity, optimized computing capabilities, refined individual and community expertise, and more, particularly in languages such as English, French, and Spanish. However, such global north unilateral strides have inadvertently created a substantial representation bias towards many languages categorized as low-resourced languages, with the majority being African languages. As a result, rudimentary resources such as stopwords, lemmatizers, stemmers, and word embeddings, as well as advanced multilingual transformer-based models remain under-developed for these languages. Compounding these circumstances is the lack of insights surrounding the development of these resources in the low-resourced context (e.g., how to develop embeddings for morphologically rich languages). Looking back, research priorities aiming to create these resources, largely motivated by the high cost attached to remedying these issues shifted, leading to the rise of alternative methods such as cross-lingual transfer learning (CLTL). CLTL involves transferring domain knowledge gained from supervised training to a domain with limited supervision signals. This study conducts a systematic literature review of CLTL techniques, in the context of cross-lingual models and embeddings, looking at their mathematical foundations, application domains, evaluation metrics, languages covered, and the latest developments. The findings of this study offer valuable insights into the present scenario of CLTL techniques, identifying areas for future research and development to advance cross-lingual natural language processing applications specifically in low-resourced settings.
跨语言嵌入方法和应用:低资源场景的系统回顾
自然语言处理(NLP)领域在许多领域取得了显著的成功,例如开发大规模数据集、算法复杂性、优化计算能力、改进个人和社区专业知识等等,特别是在英语、法语和西班牙语等语言方面。然而,这种全球北方单方面的进步无意中造成了对许多被归类为资源匮乏语言的语言的大量代表性偏见,其中大多数是非洲语言。因此,针对这些语言的基本资源,如停止词、词法分析器、词干器和词嵌入,以及基于多语言转换器的高级模型仍然没有开发出来。使这些情况更加复杂的是,在资源匮乏的环境中缺乏对这些资源开发的见解(例如,如何为形态丰富的语言开发嵌入)。回顾过去,旨在创造这些资源的研究重点,主要是由于补救这些问题所附带的高成本而转移,导致跨语言迁移学习(CLTL)等替代方法的兴起。CLTL是将从监督训练中获得的领域知识转移到监督信号有限的领域。本研究在跨语言模型和嵌入的背景下,对CLTL技术进行了系统的文献综述,考察了它们的数学基础、应用领域、评估指标、涵盖的语言和最新发展。本研究的发现为CLTL技术的现状提供了有价值的见解,确定了未来研究和发展的领域,以推进跨语言自然语言处理应用,特别是在资源匮乏的环境中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信