Language Embeddings Sometimes Contain Typological Generalizations

IF 9.3 2区 计算机科学
Robert Östling, Murathan Kurfalı
{"title":"Language Embeddings Sometimes Contain Typological Generalizations","authors":"Robert Östling, Murathan Kurfalı","doi":"10.1162/coli_a_00491","DOIUrl":null,"url":null,"abstract":"Abstract To what extent can neural network models learn generalizations about language structure, and how do we find out what they have learned? We explore these questions by training neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1,295 languages. The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features obtained through annotation projection. We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most of our models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations. Careful attention to details in the evaluation turns out to be essential to avoid false positives. Furthermore, to encourage continued work in this field, we release several resources covering most or all of the languages in our data: (1) multiple sets of language representations, (2) multilingual word embeddings, (3) projected and predicted syntactic and morphological features, (4) software to provide linguistically sound evaluations of language representations.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"133 1","pages":"0"},"PeriodicalIF":9.3000,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1162/coli_a_00491","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract To what extent can neural network models learn generalizations about language structure, and how do we find out what they have learned? We explore these questions by training neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1,295 languages. The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features obtained through annotation projection. We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most of our models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations. Careful attention to details in the evaluation turns out to be essential to avoid false positives. Furthermore, to encourage continued work in this field, we release several resources covering most or all of the languages in our data: (1) multiple sets of language representations, (2) multilingual word embeddings, (3) projected and predicted syntactic and morphological features, (4) software to provide linguistically sound evaluations of language representations.
语言嵌入有时包含类型概括
神经网络模型在多大程度上可以学习语言结构的泛化,我们如何发现它们学到了什么?我们通过训练神经模型,在1,295种语言的圣经翻译的大规模多语言数据集上进行一系列自然语言处理任务,来探索这些问题。然后将学习到的语言表征与现有的类型数据库以及通过注释投影获得的一组新的定量句法和形态学特征进行比较。我们得出的结论是,一些概括与语言类型学的传统特征惊人地接近,但我们的大多数模型,以及以前的工作,似乎并没有做出语言学上有意义的概括。在评估中仔细注意细节是避免误报的必要条件。此外,为了鼓励在这一领域的持续工作,我们发布了几个资源,涵盖了我们数据中的大部分或全部语言:(1)多套语言表示,(2)多语言词嵌入,(3)预测和预测的句法和形态特征,(4)提供语言表示的语言健全评估的软件。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computational Linguistics
Computational Linguistics Computer Science-Artificial Intelligence
自引率
0.00%
发文量
45
期刊介绍: Computational Linguistics is the longest-running publication devoted exclusively to the computational and mathematical properties of language and the design and analysis of natural language processing systems. This highly regarded quarterly offers university and industry linguists, computational linguists, artificial intelligence and machine learning investigators, cognitive scientists, speech specialists, and philosophers the latest information about the computational aspects of all the facets of research on language.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信