Lexical diversity as a lens into the classification of Slavic languages: A quantitative typology perspective

IF 1.1 3区文学 0 HUMANITIES, MULTIDISCIPLINARY

Digital Scholarship in the Humanities Pub Date : 2023-06-09 DOI:10.1093/llc/fqad042

Chenliang Zhou, Haitao Liu

{"title":"Lexical diversity as a lens into the classification of Slavic languages: A quantitative typology perspective","authors":"Chenliang Zhou, Haitao Liu","doi":"10.1093/llc/fqad042","DOIUrl":null,"url":null,"abstract":"\n This study proposes a linguistic classification method based on quantitative typology, which leverages a large-scale multilingual parallel corpus to obtain valid language classification result by excluding the influence of covariates such as text genre and semantic content in cross-language comparison. To achieve this, we model the type–token relationships of each Slavic parallel text and calculate the lexical diversity to approximate the morphological complexity of the language. We perform automatic clustering of languages based on these lexical diversity metrics. Our findings show that (1) the lexical diversity metrics can well reflect that the language is located somewhere on the continuum of ‘analytism-synthetism’; (2) the automatic clustering based on these metrics effectively reflects the genealogical classification of Slavic languages; and (3) the geographical distribution of lexical diversity in the region where Slavic languages are spoken shows a monotonic increasing trend from southwest to northeast, which is consistent with the pattern found by previous authors on a global scale. The methodological approach taken in this study is data-driven, with the benefit of being independent of theoretical assumptions and easy for computer processing. This approach can offer a better insight into corpus-based typology and may shed light on the understanding of language as a human-driven complex adaptive system.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":" ","pages":""},"PeriodicalIF":1.1000,"publicationDate":"2023-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Scholarship in the Humanities","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1093/llc/fqad042","RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"HUMANITIES, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

This study proposes a linguistic classification method based on quantitative typology, which leverages a large-scale multilingual parallel corpus to obtain valid language classification result by excluding the influence of covariates such as text genre and semantic content in cross-language comparison. To achieve this, we model the type–token relationships of each Slavic parallel text and calculate the lexical diversity to approximate the morphological complexity of the language. We perform automatic clustering of languages based on these lexical diversity metrics. Our findings show that (1) the lexical diversity metrics can well reflect that the language is located somewhere on the continuum of ‘analytism-synthetism’; (2) the automatic clustering based on these metrics effectively reflects the genealogical classification of Slavic languages; and (3) the geographical distribution of lexical diversity in the region where Slavic languages are spoken shows a monotonic increasing trend from southwest to northeast, which is consistent with the pattern found by previous authors on a global scale. The methodological approach taken in this study is data-driven, with the benefit of being independent of theoretical assumptions and easy for computer processing. This approach can offer a better insight into corpus-based typology and may shed light on the understanding of language as a human-driven complex adaptive system.

查看原文本刊更多论文

从数量类型学角度看斯拉夫语言分类中的词汇多样性

本研究提出了一种基于定量类型学的语言分类方法，该方法利用大规模的多语言并行语料库，在跨语言比较中排除文本类型和语义内容等协变量的影响，获得有效的语言分类结果。为了实现这一点，我们对每个斯拉夫语平行文本的类型-标记关系进行建模，并计算词汇多样性，以近似语言的形态复杂性。我们基于这些词汇多样性度量来执行语言的自动聚类。我们的研究结果表明：（1）词汇多样性度量可以很好地反映语言位于“分析-综合”连续体上的某个位置；（2）基于这些度量的自动聚类有效地反映了斯拉夫语言的谱系分类；（3）斯拉夫语地区词汇多样性的地理分布呈现出从西南向东北单调增加的趋势，这与前人在全球范围内发现的模式一致。本研究采用的方法是数据驱动的，其优点是独立于理论假设，易于计算机处理。这种方法可以更好地了解基于语料库的类型学，并有助于理解语言作为一个人类驱动的复杂自适应系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Digital Scholarship in the Humanities Multiple-

CiteScore

1.80

自引率

25.00%

发文量

期刊介绍： DSH or Digital Scholarship in the Humanities is an international, peer reviewed journal which publishes original contributions on all aspects of digital scholarship in the Humanities including, but not limited to, the field of what is currently called the Digital Humanities. Long and short papers report on theoretical, methodological, experimental, and applied research and include results of research projects, descriptions and evaluations of tools, techniques, and methodologies, and reports on work in progress. DSH also publishes reviews of books and resources. Digital Scholarship in the Humanities was previously known as Literary and Linguistic Computing.