Yurong Wang , Min Lin , Qitu Hu , Shuangcheng Bai , Yanling Li , Longjie Bao
{"title":"A domain-specific cross-lingual semantic alignment learning model for low-resource languages","authors":"Yurong Wang , Min Lin , Qitu Hu , Shuangcheng Bai , Yanling Li , Longjie Bao","doi":"10.1016/j.neunet.2025.108114","DOIUrl":null,"url":null,"abstract":"<div><div>Cross-lingual semantic alignment models facilitate the sharing and utilization of multilingual domain-specific data (e.g., medical, legal), offering cost-effective solutions for improving low-resource language tasks. However, existing methods are challenged by parallel data scarcity, semantic space heterogeneity, morphological complexity, and weak robustness-particularly for agglutinative languages. Therefore, this paper proposes CLWKD, a cross-lingual mapping and knowledge distillation framework. CLWKD leverages domain-specific pretrained models from high-resource languages as teachers and integrates multi-granularity alignment matrices with limited parallel data to guide cross-lingual knowledge transfer. CLWKD jointly learns multi-granularity semantic alignment mapping matrices at the token, word, and sentence levels from general-domain data. It eases domain data scarcity and helps bridge structural gaps caused by morphological and syntactic differences. To alleviate data sparsity and out-of-vocabulary issues in agglutinative languages, multilingual embedding sharing and morphological segmentation strategies are introduced. To improve the stability of unsupervised mapping training, generator pretraining is introduced and further combined with high-confidence word and sentence pairs to optimize the mapping matrix.To preserve alignment with fewer parameters, a parameter recycling and embedding bottleneck design is adopted. Experiments across the medical, legal, and educational domains on Mongolian-Chinese and Korean-Chinese language pairs demonstrate the effectiveness of CLWKD in three cross-lingual tasks.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"194 ","pages":"Article 108114"},"PeriodicalIF":6.3000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025009943","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Cross-lingual semantic alignment models facilitate the sharing and utilization of multilingual domain-specific data (e.g., medical, legal), offering cost-effective solutions for improving low-resource language tasks. However, existing methods are challenged by parallel data scarcity, semantic space heterogeneity, morphological complexity, and weak robustness-particularly for agglutinative languages. Therefore, this paper proposes CLWKD, a cross-lingual mapping and knowledge distillation framework. CLWKD leverages domain-specific pretrained models from high-resource languages as teachers and integrates multi-granularity alignment matrices with limited parallel data to guide cross-lingual knowledge transfer. CLWKD jointly learns multi-granularity semantic alignment mapping matrices at the token, word, and sentence levels from general-domain data. It eases domain data scarcity and helps bridge structural gaps caused by morphological and syntactic differences. To alleviate data sparsity and out-of-vocabulary issues in agglutinative languages, multilingual embedding sharing and morphological segmentation strategies are introduced. To improve the stability of unsupervised mapping training, generator pretraining is introduced and further combined with high-confidence word and sentence pairs to optimize the mapping matrix.To preserve alignment with fewer parameters, a parameter recycling and embedding bottleneck design is adopted. Experiments across the medical, legal, and educational domains on Mongolian-Chinese and Korean-Chinese language pairs demonstrate the effectiveness of CLWKD in three cross-lingual tasks.
期刊介绍:
Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.