Investigating Language Relationships in Multilingual Sentence Encoders Through the Lens of Linguistic Typology

IF 5.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics Pub Date : 2022-04-13 DOI:10.1162/coli_a_00444

Rochelle Choenni, Ekaterina Shutova

{"title":"Investigating Language Relationships in Multilingual Sentence Encoders Through the Lens of Linguistic Typology","authors":"Rochelle Choenni, Ekaterina Shutova","doi":"10.1162/coli_a_00444","DOIUrl":null,"url":null,"abstract":"Abstract Multilingual sentence encoders have seen much success in cross-lingual model transfer for downstream NLP tasks. The success of this transfer is, however, dependent on the model’s ability to encode the patterns of cross-lingual similarity and variation. Yet, we know relatively little about the properties of individual languages or the general patterns of linguistic variation that the models encode. In this article, we investigate these questions by leveraging knowledge from the field of linguistic typology, which studies and documents structural and semantic variation across languages. We propose methods for separating language-specific subspaces within state-of-the-art multilingual sentence encoders (LASER, M-BERT, XLM, and XLM-R) with respect to a range of typological properties pertaining to lexical, morphological, and syntactic structure. Moreover, we investigate how typological information about languages is distributed across all layers of the models. Our results show interesting differences in encoding linguistic variation associated with different pretraining strategies. In addition, we propose a simple method to study how shared typological properties of languages are encoded in two state-of-the-art multilingual models—M-BERT and XLM-R. The results provide insight into their information-sharing mechanisms and suggest that these linguistic properties are encoded jointly across typologically similar languages in these models.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"635-672"},"PeriodicalIF":5.3000,"publicationDate":"2022-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00444","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 12

Abstract

Abstract Multilingual sentence encoders have seen much success in cross-lingual model transfer for downstream NLP tasks. The success of this transfer is, however, dependent on the model’s ability to encode the patterns of cross-lingual similarity and variation. Yet, we know relatively little about the properties of individual languages or the general patterns of linguistic variation that the models encode. In this article, we investigate these questions by leveraging knowledge from the field of linguistic typology, which studies and documents structural and semantic variation across languages. We propose methods for separating language-specific subspaces within state-of-the-art multilingual sentence encoders (LASER, M-BERT, XLM, and XLM-R) with respect to a range of typological properties pertaining to lexical, morphological, and syntactic structure. Moreover, we investigate how typological information about languages is distributed across all layers of the models. Our results show interesting differences in encoding linguistic variation associated with different pretraining strategies. In addition, we propose a simple method to study how shared typological properties of languages are encoded in two state-of-the-art multilingual models—M-BERT and XLM-R. The results provide insight into their information-sharing mechanisms and suggest that these linguistic properties are encoded jointly across typologically similar languages in these models.

查看原文本刊更多论文

语言类型学视角下的多语言句子编码器语言关系研究

摘要多语言句子编码器在下游NLP任务的跨语言模型转移方面取得了很大成功。然而，这种转移的成功取决于模型对跨语言相似性和变异模式进行编码的能力。然而，我们对个别语言的特性或模型编码的语言变异的一般模式知之甚少。在这篇文章中，我们通过利用语言类型学领域的知识来研究这些问题，该领域研究并记录了不同语言之间的结构和语义变化。我们提出了在最先进的多语言句子编码器（LASER、M-BERT、XLM和XLM-R）中根据与词汇、形态和句法结构有关的一系列类型学特性来分离语言特定子空间的方法。此外，我们还研究了关于语言的类型信息是如何分布在模型的所有层中的。我们的研究结果显示，不同的预训练策略在编码语言变异方面存在着有趣的差异。此外，我们提出了一种简单的方法来研究语言的共享类型学属性是如何在两个最先进的多语言模型中编码的——M-BERT和XLM-R。研究结果深入了解了它们的信息共享机制，并表明这些语言属性是在这些模型中跨类型相似的语言共同编码的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Linguistics 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Computational Linguistics, the longest-running publication dedicated solely to the computational and mathematical aspects of language and the design of natural language processing systems, provides university and industry linguists, computational linguists, AI and machine learning researchers, cognitive scientists, speech specialists, and philosophers with the latest insights into the computational aspects of language research.