Correcting a bias in TIGER rates resulting from high amounts of invariant and singleton cognate sets

IF 2.1 0 LANGUAGE & LINGUISTICS
Johann-Mattis List
{"title":"Correcting a bias in TIGER rates resulting from high amounts of invariant and singleton cognate sets","authors":"Johann-Mattis List","doi":"10.1093/jole/lzab007","DOIUrl":null,"url":null,"abstract":"\n In a recent issue of the Journal of Language Evolution, Syrjänen et al. (2021) investigate the suitability of computing Cummins and McInerney’s (2011) TIGER rates for estimating the tree-likeness of linguistic datasets compiled for phylogenetic reconstruction. The authors test the TIGER rates on a diverse sample of simulated data, which by and large confirms the usefulness of TIGER rates as an analytic tool for investigating linguistic data, but they test them only on one real-world dataset of Uralic languages which turns out to behave quite differently from the simulated data. When testing the TIGER rates on additional datasets, I detected a bias in the computation which leads to an unnatural increase in those cases where a dataset contains many characters with invariant or singleton states. To overcome this problem, I suggest a modified variant of TIGER rates, which is provided in the form of a freely available Python package. Testing the modified TIGER scores on the simulated data of Syrjänen et al. shows that the corrected TIGER rates still readily distinguish between different degrees of tree-likeness. Testing them on a dataset in which the number of singletons and invariants was artificially increased further shows that the corrected TIGER rates are not influenced by the bias. A final tests on seven linguistic datasets show the usefulness of the corrected TIGER rates on a larger variety of linguistic datasets and illustrate the importance to take specific aspects of linguistic data into account when using biological methods in the domain of language evolution.","PeriodicalId":37118,"journal":{"name":"Journal of Language Evolution","volume":null,"pages":null},"PeriodicalIF":2.1000,"publicationDate":"2022-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Language Evolution","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jole/lzab007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 1

Abstract

In a recent issue of the Journal of Language Evolution, Syrjänen et al. (2021) investigate the suitability of computing Cummins and McInerney’s (2011) TIGER rates for estimating the tree-likeness of linguistic datasets compiled for phylogenetic reconstruction. The authors test the TIGER rates on a diverse sample of simulated data, which by and large confirms the usefulness of TIGER rates as an analytic tool for investigating linguistic data, but they test them only on one real-world dataset of Uralic languages which turns out to behave quite differently from the simulated data. When testing the TIGER rates on additional datasets, I detected a bias in the computation which leads to an unnatural increase in those cases where a dataset contains many characters with invariant or singleton states. To overcome this problem, I suggest a modified variant of TIGER rates, which is provided in the form of a freely available Python package. Testing the modified TIGER scores on the simulated data of Syrjänen et al. shows that the corrected TIGER rates still readily distinguish between different degrees of tree-likeness. Testing them on a dataset in which the number of singletons and invariants was artificially increased further shows that the corrected TIGER rates are not influenced by the bias. A final tests on seven linguistic datasets show the usefulness of the corrected TIGER rates on a larger variety of linguistic datasets and illustrate the importance to take specific aspects of linguistic data into account when using biological methods in the domain of language evolution.
纠正由于大量不变和单一同源集而导致的TIGER率偏差
在最近一期的《语言进化杂志》中,Syrjänen等人(2021)研究了计算Cummins和McInerney(2011)的TIGER率用于估计用于系统发育重建的语言数据集的树状相似性的适用性。作者在不同的模拟数据样本上测试了TIGER率,这在很大程度上证实了TIGER率作为调查语言数据的分析工具的有效性,但他们只在乌拉尔语的一个真实数据集上测试了TIGER率,结果发现它的行为与模拟数据大不相同。当在其他数据集上测试TIGER率时,我检测到计算中的偏差,当数据集包含许多具有不变或单例状态的字符时,这种偏差会导致不自然的增加。为了克服这个问题,我建议使用TIGER速率的修改变体,它以免费提供的Python包的形式提供。在Syrjänen等人的模拟数据上测试修改后的TIGER分数表明,修正后的TIGER率仍然很容易区分不同程度的树相似度。在人工增加单例和不变量数量的数据集上测试它们进一步表明,校正后的TIGER率不受偏差的影响。对七个语言数据集进行的最后测试表明,修正后的TIGER率在更多种类的语言数据集上是有用的,并说明在语言进化领域使用生物学方法时考虑语言数据的特定方面的重要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Language Evolution
Journal of Language Evolution Social Sciences-Linguistics and Language
CiteScore
4.50
自引率
7.70%
发文量
8
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信