Crouching TIGER, hidden structure: Exploring the nature of linguistic data using TIGER values

IF 2.1 0 LANGUAGE & LINGUISTICS

Journal of Language Evolution Pub Date : 2021-11-15 DOI:10.1093/jole/lzab004

K. Syrjänen, L. Maurits, Unni Leino, T. Honkola, J. Rota, O. Vesakoski

{"title":"Crouching TIGER, hidden structure: Exploring the nature of linguistic data using TIGER values","authors":"K. Syrjänen, L. Maurits, Unni Leino, T. Honkola, J. Rota, O. Vesakoski","doi":"10.1093/jole/lzab004","DOIUrl":null,"url":null,"abstract":"\n In recent years, techniques such as Bayesian inference of phylogeny have become a standard part of the quantitative linguistic toolkit. While these tools successfully model the tree-like component of a linguistic dataset, real-world datasets generally include a combination of tree-like and nontree-like signals. Alongside developing techniques for modeling nontree-like data, an important requirement for future quantitative work is to build a principled understanding of this structural complexity of linguistic datasets. Some techniques exist for exploring the general structure of a linguistic dataset, such as NeighborNets, δ scores, and Q-residuals; however, these methods are not without limitations or drawbacks. In general, the question of what kinds of historical structure a linguistic dataset can contain and how these might be detected or measured remains critically underexplored from an objective, quantitative perspective. In this article, we propose TIGER values, a metric that estimates the internal consistency of a genetic dataset, as an additional metric for assessing how tree-like a linguistic dataset is. We use TIGER values to explore simulated language data ranging from very tree-like to completely unstructured, and also use them to analyze a cognate-coded basic vocabulary dataset of Uralic languages. As a point of comparison for the TIGER values, we also explore the same data using δ scores, Q-residuals, and NeighborNets. Our results suggest that TIGER values are capable of both ranking tree-like datasets according to their degree of treelikeness, as well as distinguishing datasets with tree-like structure from datasets with a nontree-like structure. Consequently, we argue that TIGER values serve as a useful metric for measuring the historical heterogeneity of datasets. Our results also highlight the complexities in measuring treelikeness from linguistic data, and how the metrics approach this question from different perspectives.","PeriodicalId":37118,"journal":{"name":"Journal of Language Evolution","volume":" ","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2021-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Language Evolution","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jole/lzab004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}

引用次数: 6

Abstract

In recent years, techniques such as Bayesian inference of phylogeny have become a standard part of the quantitative linguistic toolkit. While these tools successfully model the tree-like component of a linguistic dataset, real-world datasets generally include a combination of tree-like and nontree-like signals. Alongside developing techniques for modeling nontree-like data, an important requirement for future quantitative work is to build a principled understanding of this structural complexity of linguistic datasets. Some techniques exist for exploring the general structure of a linguistic dataset, such as NeighborNets, δ scores, and Q-residuals; however, these methods are not without limitations or drawbacks. In general, the question of what kinds of historical structure a linguistic dataset can contain and how these might be detected or measured remains critically underexplored from an objective, quantitative perspective. In this article, we propose TIGER values, a metric that estimates the internal consistency of a genetic dataset, as an additional metric for assessing how tree-like a linguistic dataset is. We use TIGER values to explore simulated language data ranging from very tree-like to completely unstructured, and also use them to analyze a cognate-coded basic vocabulary dataset of Uralic languages. As a point of comparison for the TIGER values, we also explore the same data using δ scores, Q-residuals, and NeighborNets. Our results suggest that TIGER values are capable of both ranking tree-like datasets according to their degree of treelikeness, as well as distinguishing datasets with tree-like structure from datasets with a nontree-like structure. Consequently, we argue that TIGER values serve as a useful metric for measuring the historical heterogeneity of datasets. Our results also highlight the complexities in measuring treelikeness from linguistic data, and how the metrics approach this question from different perspectives.

查看原文本刊更多论文

卧虎藏龙，隐藏结构:用TIGER值探索语言数据的本质

近年来，系统发育的贝叶斯推理等技术已成为定量语言学工具包的标准组成部分。虽然这些工具成功地对语言数据集的树状成分进行了建模，但现实世界的数据集通常包括树状和非树状信号的组合。除了开发非三类数据建模技术外，未来定量工作的一个重要要求是对语言数据集的这种结构复杂性建立原则性的理解。存在一些用于探索语言数据集的一般结构的技术，如邻居网、δ分数和Q残差；然而，这些方法并非没有限制或缺点。总的来说，从客观、定量的角度来看，语言数据集可以包含什么样的历史结构以及如何检测或测量这些历史结构的问题仍然严重缺乏探索。在这篇文章中，我们提出了TIGER值，这是一种估计遗传数据集内部一致性的指标，作为评估语言数据集树状程度的额外指标。我们使用TIGER数值来探索从非常树状到完全非结构化的模拟语言数据，并用它们来分析乌拉尔语的同源编码基本词汇数据集。作为TIGER值的比较点，我们还使用δ分数、Q残差和邻居网来探索相同的数据。我们的结果表明，TIGER值既能够根据树状数据集的树状程度对其进行排序，也能够区分具有树状结构的数据集和具有非树状结构的数据库。因此，我们认为TIGER值是衡量数据集历史异质性的有用指标。我们的研究结果还强调了从语言数据中测量树木相似性的复杂性，以及度量标准如何从不同角度处理这个问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Language Evolution Social Sciences-Linguistics and Language

CiteScore

4.50

自引率

7.70%

发文量