Investigating Language Relationships in Multilingual Sentence Encoders Through the Lens of Linguistic Typology

IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Rochelle Choenni, Ekaterina Shutova
{"title":"Investigating Language Relationships in Multilingual Sentence Encoders Through the Lens of Linguistic Typology","authors":"Rochelle Choenni, Ekaterina Shutova","doi":"10.1162/coli_a_00444","DOIUrl":null,"url":null,"abstract":"Abstract Multilingual sentence encoders have seen much success in cross-lingual model transfer for downstream NLP tasks. The success of this transfer is, however, dependent on the model’s ability to encode the patterns of cross-lingual similarity and variation. Yet, we know relatively little about the properties of individual languages or the general patterns of linguistic variation that the models encode. In this article, we investigate these questions by leveraging knowledge from the field of linguistic typology, which studies and documents structural and semantic variation across languages. We propose methods for separating language-specific subspaces within state-of-the-art multilingual sentence encoders (LASER, M-BERT, XLM, and XLM-R) with respect to a range of typological properties pertaining to lexical, morphological, and syntactic structure. Moreover, we investigate how typological information about languages is distributed across all layers of the models. Our results show interesting differences in encoding linguistic variation associated with different pretraining strategies. In addition, we propose a simple method to study how shared typological properties of languages are encoded in two state-of-the-art multilingual models—M-BERT and XLM-R. The results provide insight into their information-sharing mechanisms and suggest that these linguistic properties are encoded jointly across typologically similar languages in these models.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"635-672"},"PeriodicalIF":3.7000,"publicationDate":"2022-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00444","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 12

Abstract

Abstract Multilingual sentence encoders have seen much success in cross-lingual model transfer for downstream NLP tasks. The success of this transfer is, however, dependent on the model’s ability to encode the patterns of cross-lingual similarity and variation. Yet, we know relatively little about the properties of individual languages or the general patterns of linguistic variation that the models encode. In this article, we investigate these questions by leveraging knowledge from the field of linguistic typology, which studies and documents structural and semantic variation across languages. We propose methods for separating language-specific subspaces within state-of-the-art multilingual sentence encoders (LASER, M-BERT, XLM, and XLM-R) with respect to a range of typological properties pertaining to lexical, morphological, and syntactic structure. Moreover, we investigate how typological information about languages is distributed across all layers of the models. Our results show interesting differences in encoding linguistic variation associated with different pretraining strategies. In addition, we propose a simple method to study how shared typological properties of languages are encoded in two state-of-the-art multilingual models—M-BERT and XLM-R. The results provide insight into their information-sharing mechanisms and suggest that these linguistic properties are encoded jointly across typologically similar languages in these models.
语言类型学视角下的多语言句子编码器语言关系研究
摘要多语言句子编码器在下游NLP任务的跨语言模型转移方面取得了很大成功。然而,这种转移的成功取决于模型对跨语言相似性和变异模式进行编码的能力。然而,我们对个别语言的特性或模型编码的语言变异的一般模式知之甚少。在这篇文章中,我们通过利用语言类型学领域的知识来研究这些问题,该领域研究并记录了不同语言之间的结构和语义变化。我们提出了在最先进的多语言句子编码器(LASER、M-BERT、XLM和XLM-R)中根据与词汇、形态和句法结构有关的一系列类型学特性来分离语言特定子空间的方法。此外,我们还研究了关于语言的类型信息是如何分布在模型的所有层中的。我们的研究结果显示,不同的预训练策略在编码语言变异方面存在着有趣的差异。此外,我们提出了一种简单的方法来研究语言的共享类型学属性是如何在两个最先进的多语言模型中编码的——M-BERT和XLM-R。研究结果深入了解了它们的信息共享机制,并表明这些语言属性是在这些模型中跨类型相似的语言共同编码的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computational Linguistics
Computational Linguistics 工程技术-计算机:跨学科应用
CiteScore
15.80
自引率
0.00%
发文量
45
审稿时长
>12 weeks
期刊介绍: Computational Linguistics, the longest-running publication dedicated solely to the computational and mathematical aspects of language and the design of natural language processing systems, provides university and industry linguists, computational linguists, AI and machine learning researchers, cognitive scientists, speech specialists, and philosophers with the latest insights into the computational aspects of language research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信