{"title":"Cross-lingual embedding methods and applications: A systematic review for low-resourced scenarios","authors":"Thapelo Sindane , Vukosi Marivate , Abiodun Modupe","doi":"10.1016/j.nlp.2025.100157","DOIUrl":null,"url":null,"abstract":"<div><div>The field of Natural Language Processing (NLP) has achieved significant success in various areas, such as developing large-scale datasets, algorithmic complexity, optimized computing capabilities, refined individual and community expertise, and more, particularly in languages such as English, French, and Spanish. However, such global north unilateral strides have inadvertently created a substantial representation bias towards many languages categorized as low-resourced languages, with the majority being African languages. As a result, rudimentary resources such as stopwords, lemmatizers, stemmers, and word embeddings, as well as advanced multilingual transformer-based models remain under-developed for these languages. Compounding these circumstances is the lack of insights surrounding the development of these resources in the low-resourced context (e.g., how to develop embeddings for morphologically rich languages). Looking back, research priorities aiming to create these resources, largely motivated by the high cost attached to remedying these issues shifted, leading to the rise of alternative methods such as cross-lingual transfer learning (CLTL). CLTL involves transferring domain knowledge gained from supervised training to a domain with limited supervision signals. This study conducts a systematic literature review of CLTL techniques, in the context of cross-lingual models and embeddings, looking at their mathematical foundations, application domains, evaluation metrics, languages covered, and the latest developments. The findings of this study offer valuable insights into the present scenario of CLTL techniques, identifying areas for future research and development to advance cross-lingual natural language processing applications specifically in low-resourced settings.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"12 ","pages":"Article 100157"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719125000330","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The field of Natural Language Processing (NLP) has achieved significant success in various areas, such as developing large-scale datasets, algorithmic complexity, optimized computing capabilities, refined individual and community expertise, and more, particularly in languages such as English, French, and Spanish. However, such global north unilateral strides have inadvertently created a substantial representation bias towards many languages categorized as low-resourced languages, with the majority being African languages. As a result, rudimentary resources such as stopwords, lemmatizers, stemmers, and word embeddings, as well as advanced multilingual transformer-based models remain under-developed for these languages. Compounding these circumstances is the lack of insights surrounding the development of these resources in the low-resourced context (e.g., how to develop embeddings for morphologically rich languages). Looking back, research priorities aiming to create these resources, largely motivated by the high cost attached to remedying these issues shifted, leading to the rise of alternative methods such as cross-lingual transfer learning (CLTL). CLTL involves transferring domain knowledge gained from supervised training to a domain with limited supervision signals. This study conducts a systematic literature review of CLTL techniques, in the context of cross-lingual models and embeddings, looking at their mathematical foundations, application domains, evaluation metrics, languages covered, and the latest developments. The findings of this study offer valuable insights into the present scenario of CLTL techniques, identifying areas for future research and development to advance cross-lingual natural language processing applications specifically in low-resourced settings.