Improving linear orthogonal mapping based cross-lingual representation using ridge regression and graph centrality

IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Deepen Naorem, Sanasam Ranbir Singh, Priyankoo Sarmah
{"title":"Improving linear orthogonal mapping based cross-lingual representation using ridge regression and graph centrality","authors":"Deepen Naorem,&nbsp;Sanasam Ranbir Singh,&nbsp;Priyankoo Sarmah","doi":"10.1016/j.csl.2024.101640","DOIUrl":null,"url":null,"abstract":"<div><p>Orthogonal linear mapping is a commonly used approach for generating cross-lingual embedding between two monolingual corpora that uses a word frequency-based seed dictionary alignment approach. While this approach is found to be effective for isomorphic language pairs, they do not perform well for distant language pairs with different sentence structures and morphological properties. For a distance language pair, the existing frequency-aligned orthogonal mapping methods suffer from two problems - (i)the frequency of source and target word are not comparable, and (ii)different word pairs in the seed dictionary may have different contribution. Motivated by the above two concerns, this paper proposes a novel centrality-aligned ridge regression-based orthogonal mapping. The proposed method uses centrality-based alignment for seed dictionary selection and ridge regression framework for incorporating influential weights of different word pairs in the seed dictionary. From various experimental observations over five language pairs (both isomorphic and distant languages), it is evident that the proposed method outperforms baseline methods in the Bilingual Dictionary Induction(BDI) task, Sentence Retrieval Task(SRT), and Machine Translation. Further, several analyses are also included to support the proposed method.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101640"},"PeriodicalIF":3.1000,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000238","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Orthogonal linear mapping is a commonly used approach for generating cross-lingual embedding between two monolingual corpora that uses a word frequency-based seed dictionary alignment approach. While this approach is found to be effective for isomorphic language pairs, they do not perform well for distant language pairs with different sentence structures and morphological properties. For a distance language pair, the existing frequency-aligned orthogonal mapping methods suffer from two problems - (i)the frequency of source and target word are not comparable, and (ii)different word pairs in the seed dictionary may have different contribution. Motivated by the above two concerns, this paper proposes a novel centrality-aligned ridge regression-based orthogonal mapping. The proposed method uses centrality-based alignment for seed dictionary selection and ridge regression framework for incorporating influential weights of different word pairs in the seed dictionary. From various experimental observations over five language pairs (both isomorphic and distant languages), it is evident that the proposed method outperforms baseline methods in the Bilingual Dictionary Induction(BDI) task, Sentence Retrieval Task(SRT), and Machine Translation. Further, several analyses are also included to support the proposed method.

利用脊回归和图中心性改进基于线性正交映射的跨语言表示法
正交线性映射是在两个单语语料库之间生成跨语言嵌入的常用方法,它使用基于词频的种子词典对齐方法。虽然这种方法对同构语言对很有效,但对于句子结构和形态属性不同的远距离语言对,效果并不理想。对于远距离语言对,现有的频率对齐正交映射方法存在两个问题--(i) 源词和目标词的频率不具有可比性;(ii) 种子词典中的不同词对可能具有不同的贡献率。基于上述两个问题,本文提出了一种基于中心性对齐脊回归的新型正交映射方法。该方法使用基于中心性的对齐来选择种子词典,并使用脊回归框架来纳入种子词典中不同词对的影响权重。通过对五种语言对(同构语言和远源语言)的各种实验观察,可以明显看出所提出的方法在双语词典归纳(BDI)任务、句子检索任务(SRT)和机器翻译方面优于基线方法。此外,还包括几项分析,以支持所提出的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computer Speech and Language
Computer Speech and Language 工程技术-计算机:人工智能
CiteScore
11.30
自引率
4.70%
发文量
80
审稿时长
22.9 weeks
期刊介绍: Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信