A domain-specific cross-lingual semantic alignment learning model for low-resource languages

IF 6.3 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2025-09-16 DOI:10.1016/j.neunet.2025.108114

Yurong Wang , Min Lin , Qitu Hu , Shuangcheng Bai , Yanling Li , Longjie Bao

{"title":"A domain-specific cross-lingual semantic alignment learning model for low-resource languages","authors":"Yurong Wang , Min Lin , Qitu Hu , Shuangcheng Bai , Yanling Li , Longjie Bao","doi":"10.1016/j.neunet.2025.108114","DOIUrl":null,"url":null,"abstract":"<div><div>Cross-lingual semantic alignment models facilitate the sharing and utilization of multilingual domain-specific data (e.g., medical, legal), offering cost-effective solutions for improving low-resource language tasks. However, existing methods are challenged by parallel data scarcity, semantic space heterogeneity, morphological complexity, and weak robustness-particularly for agglutinative languages. Therefore, this paper proposes CLWKD, a cross-lingual mapping and knowledge distillation framework. CLWKD leverages domain-specific pretrained models from high-resource languages as teachers and integrates multi-granularity alignment matrices with limited parallel data to guide cross-lingual knowledge transfer. CLWKD jointly learns multi-granularity semantic alignment mapping matrices at the token, word, and sentence levels from general-domain data. It eases domain data scarcity and helps bridge structural gaps caused by morphological and syntactic differences. To alleviate data sparsity and out-of-vocabulary issues in agglutinative languages, multilingual embedding sharing and morphological segmentation strategies are introduced. To improve the stability of unsupervised mapping training, generator pretraining is introduced and further combined with high-confidence word and sentence pairs to optimize the mapping matrix.To preserve alignment with fewer parameters, a parameter recycling and embedding bottleneck design is adopted. Experiments across the medical, legal, and educational domains on Mongolian-Chinese and Korean-Chinese language pairs demonstrate the effectiveness of CLWKD in three cross-lingual tasks.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"194 ","pages":"Article 108114"},"PeriodicalIF":6.3000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025009943","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Cross-lingual semantic alignment models facilitate the sharing and utilization of multilingual domain-specific data (e.g., medical, legal), offering cost-effective solutions for improving low-resource language tasks. However, existing methods are challenged by parallel data scarcity, semantic space heterogeneity, morphological complexity, and weak robustness-particularly for agglutinative languages. Therefore, this paper proposes CLWKD, a cross-lingual mapping and knowledge distillation framework. CLWKD leverages domain-specific pretrained models from high-resource languages as teachers and integrates multi-granularity alignment matrices with limited parallel data to guide cross-lingual knowledge transfer. CLWKD jointly learns multi-granularity semantic alignment mapping matrices at the token, word, and sentence levels from general-domain data. It eases domain data scarcity and helps bridge structural gaps caused by morphological and syntactic differences. To alleviate data sparsity and out-of-vocabulary issues in agglutinative languages, multilingual embedding sharing and morphological segmentation strategies are introduced. To improve the stability of unsupervised mapping training, generator pretraining is introduced and further combined with high-confidence word and sentence pairs to optimize the mapping matrix.To preserve alignment with fewer parameters, a parameter recycling and embedding bottleneck design is adopted. Experiments across the medical, legal, and educational domains on Mongolian-Chinese and Korean-Chinese language pairs demonstrate the effectiveness of CLWKD in three cross-lingual tasks.

查看原文本刊更多论文

针对低资源语言的特定领域跨语言语义对齐学习模型。

跨语言语义对齐模型促进了多语言领域特定数据（例如，医疗、法律）的共享和利用，为改进低资源语言任务提供了经济有效的解决方案。然而，现有的方法面临着并行数据稀缺性、语义空间异质性、形态复杂性和鲁棒性弱的挑战，特别是对于粘合语言。因此，本文提出了跨语言映射和知识蒸馏框架CLWKD。CLWKD利用来自高资源语言的特定领域预训练模型作为教师，并将多粒度对齐矩阵与有限的并行数据集成在一起，以指导跨语言知识转移。CLWKD从通用领域数据中联合学习令牌、词和句子级别的多粒度语义对齐映射矩阵。它缓解了领域数据的稀缺性，并有助于弥合由形态和句法差异引起的结构差距。为了缓解黏着语言中的数据稀疏和词汇外问题，引入了多语言嵌入共享和形态分割策略。为了提高无监督映射训练的稳定性，引入了生成器预训练，并进一步结合高置信度词对和句子对来优化映射矩阵。为了在参数较少的情况下保持对齐，采用了参数循环嵌入瓶颈设计。在医学、法律和教育领域对蒙汉语和朝鲜语汉语对的实验证明了CLWKD在三种跨语言任务中的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.