Correcting a bias in TIGER rates resulting from high amounts of invariant and singleton cognate sets

IF 2.1 0 LANGUAGE & LINGUISTICS

Journal of Language Evolution Pub Date : 2022-01-19 DOI:10.1093/jole/lzab007

Johann-Mattis List

{"title":"Correcting a bias in TIGER rates resulting from high amounts of invariant and singleton cognate sets","authors":"Johann-Mattis List","doi":"10.1093/jole/lzab007","DOIUrl":null,"url":null,"abstract":"\n In a recent issue of the Journal of Language Evolution, Syrjänen et al. (2021) investigate the suitability of computing Cummins and McInerney’s (2011) TIGER rates for estimating the tree-likeness of linguistic datasets compiled for phylogenetic reconstruction. The authors test the TIGER rates on a diverse sample of simulated data, which by and large confirms the usefulness of TIGER rates as an analytic tool for investigating linguistic data, but they test them only on one real-world dataset of Uralic languages which turns out to behave quite differently from the simulated data. When testing the TIGER rates on additional datasets, I detected a bias in the computation which leads to an unnatural increase in those cases where a dataset contains many characters with invariant or singleton states. To overcome this problem, I suggest a modified variant of TIGER rates, which is provided in the form of a freely available Python package. Testing the modified TIGER scores on the simulated data of Syrjänen et al. shows that the corrected TIGER rates still readily distinguish between different degrees of tree-likeness. Testing them on a dataset in which the number of singletons and invariants was artificially increased further shows that the corrected TIGER rates are not influenced by the bias. A final tests on seven linguistic datasets show the usefulness of the corrected TIGER rates on a larger variety of linguistic datasets and illustrate the importance to take specific aspects of linguistic data into account when using biological methods in the domain of language evolution.","PeriodicalId":37118,"journal":{"name":"Journal of Language Evolution","volume":"1 1","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2022-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Language Evolution","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jole/lzab007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}

引用次数: 1

Abstract

In a recent issue of the Journal of Language Evolution, Syrjänen et al. (2021) investigate the suitability of computing Cummins and McInerney’s (2011) TIGER rates for estimating the tree-likeness of linguistic datasets compiled for phylogenetic reconstruction. The authors test the TIGER rates on a diverse sample of simulated data, which by and large confirms the usefulness of TIGER rates as an analytic tool for investigating linguistic data, but they test them only on one real-world dataset of Uralic languages which turns out to behave quite differently from the simulated data. When testing the TIGER rates on additional datasets, I detected a bias in the computation which leads to an unnatural increase in those cases where a dataset contains many characters with invariant or singleton states. To overcome this problem, I suggest a modified variant of TIGER rates, which is provided in the form of a freely available Python package. Testing the modified TIGER scores on the simulated data of Syrjänen et al. shows that the corrected TIGER rates still readily distinguish between different degrees of tree-likeness. Testing them on a dataset in which the number of singletons and invariants was artificially increased further shows that the corrected TIGER rates are not influenced by the bias. A final tests on seven linguistic datasets show the usefulness of the corrected TIGER rates on a larger variety of linguistic datasets and illustrate the importance to take specific aspects of linguistic data into account when using biological methods in the domain of language evolution.

查看原文本刊更多论文

纠正由于大量不变和单一同源集而导致的TIGER率偏差

在最近一期的《语言进化杂志》中，Syrjänen等人(2021)研究了计算Cummins和McInerney(2011)的TIGER率用于估计用于系统发育重建的语言数据集的树状相似性的适用性。作者在不同的模拟数据样本上测试了TIGER率，这在很大程度上证实了TIGER率作为调查语言数据的分析工具的有效性，但他们只在乌拉尔语的一个真实数据集上测试了TIGER率，结果发现它的行为与模拟数据大不相同。当在其他数据集上测试TIGER率时，我检测到计算中的偏差，当数据集包含许多具有不变或单例状态的字符时，这种偏差会导致不自然的增加。为了克服这个问题，我建议使用TIGER速率的修改变体，它以免费提供的Python包的形式提供。在Syrjänen等人的模拟数据上测试修改后的TIGER分数表明，修正后的TIGER率仍然很容易区分不同程度的树相似度。在人工增加单例和不变量数量的数据集上测试它们进一步表明，校正后的TIGER率不受偏差的影响。对七个语言数据集进行的最后测试表明，修正后的TIGER率在更多种类的语言数据集上是有用的，并说明在语言进化领域使用生物学方法时考虑语言数据的特定方面的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Language Evolution Social Sciences-Linguistics and Language

CiteScore

4.50

自引率

7.70%

发文量