使系谱语言分类可用于系统发育分析

IF 1.2 0 LANGUAGE & LINGUISTICS

Language Dynamics and Change Pub Date : 2018-06-22 DOI:10.1163/22105832-00801001

D. Dediu

{"title":"使系谱语言分类可用于系统发育分析","authors":"D. Dediu","doi":"10.1163/22105832-00801001","DOIUrl":null,"url":null,"abstract":"One of the best-known types of non-independence between languages is caused by genealogical relationships due to descent from a common ancestor. These can be represented by (more or less resolved and controversial) language family trees. In theory, one can argue that language families should be built through the strict application of the comparative method of historical linguistics, but in practice this is not always the case, and there are several proposed classifications of languages into language families, each with its own advantages and disadvantages. A major stumbling block shared by most of them is that they are relatively difficult to use with computational methods, and in particular with phylogenetics. This is due to their lack of standardization, coupled with the general non-availability of branch length information, which encapsulates the amount of evolution taking place on the family tree. In this paper I introduce a method (and its implementation in R) that converts the language classifications provided by four widely-used databases (Ethnologue, WALS, AUTOTYP and Glottolog) into the de facto Newick standard generally used in phylogenetics, aligns the four most used conventions for unique identifiers of linguistic entities (ISO 639-3, WALS, AUTOTYP and Glottocode), and adds branch length information from a variety of sources (the tree’s own topology, an externally given numeric constant, or a distance matrix). The R scripts, input data and resulting Newick trees are available under liberal open-source licenses in a GitHub repository (https://github.com/ddediu/lgfam-newick), to encourage and promote the use of phylogenetic methods to investigate linguistic diversity and its temporal dynamics.","PeriodicalId":43113,"journal":{"name":"Language Dynamics and Change","volume":" ","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2018-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1163/22105832-00801001","citationCount":"7","resultStr":"{\"title\":\"Making genealogical language classifications available for phylogenetic analysis\",\"authors\":\"D. Dediu\",\"doi\":\"10.1163/22105832-00801001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the best-known types of non-independence between languages is caused by genealogical relationships due to descent from a common ancestor. These can be represented by (more or less resolved and controversial) language family trees. In theory, one can argue that language families should be built through the strict application of the comparative method of historical linguistics, but in practice this is not always the case, and there are several proposed classifications of languages into language families, each with its own advantages and disadvantages. A major stumbling block shared by most of them is that they are relatively difficult to use with computational methods, and in particular with phylogenetics. This is due to their lack of standardization, coupled with the general non-availability of branch length information, which encapsulates the amount of evolution taking place on the family tree. In this paper I introduce a method (and its implementation in R) that converts the language classifications provided by four widely-used databases (Ethnologue, WALS, AUTOTYP and Glottolog) into the de facto Newick standard generally used in phylogenetics, aligns the four most used conventions for unique identifiers of linguistic entities (ISO 639-3, WALS, AUTOTYP and Glottocode), and adds branch length information from a variety of sources (the tree’s own topology, an externally given numeric constant, or a distance matrix). The R scripts, input data and resulting Newick trees are available under liberal open-source licenses in a GitHub repository (https://github.com/ddediu/lgfam-newick), to encourage and promote the use of phylogenetic methods to investigate linguistic diversity and its temporal dynamics.\",\"PeriodicalId\":43113,\"journal\":{\"name\":\"Language Dynamics and Change\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2018-06-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1163/22105832-00801001\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Language Dynamics and Change\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1163/22105832-00801001\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"LANGUAGE & LINGUISTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Dynamics and Change","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1163/22105832-00801001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}

引用次数: 7

摘要

语言之间最著名的不独立类型之一是由共同祖先的后裔所引起的宗谱关系。这些可以用(或多或少已解决或有争议的)语言家谱来表示。理论上，人们可以认为应该通过严格应用历史语言学的比较方法来建立语族，但在实践中并非总是如此，并且有几种建议将语言分类为语族，每种语言都有自己的优点和缺点。它们中的大多数都有一个主要的障碍，那就是它们相对难以与计算方法一起使用，特别是与系统发育学一起使用。这是由于它们缺乏标准化，加上分支长度信息的不可用性，分支长度信息封装了在家族树中发生的进化的数量。在本文中，我介绍了一种方法(及其在R中的实现)，该方法将四个广泛使用的数据库(Ethnologue, WALS, AUTOTYP和Glottolog)提供的语言分类转换为系统发育学中通常使用的事实上的Newick标准，对语言实体的唯一标识符(ISO 639-3, WALS, AUTOTYP和Glottocode)的四种最常用惯例进行校准，并添加来自各种来源的分支长度信息(树自身的拓扑结构，(外部给定的数值常数，或距离矩阵)。R脚本、输入数据和生成的Newick树在GitHub存储库(https://github.com/ddediu/lgfam-newick)的自由开源许可下可用，以鼓励和促进使用系统发育方法来研究语言多样性及其时间动态。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Making genealogical language classifications available for phylogenetic analysis

One of the best-known types of non-independence between languages is caused by genealogical relationships due to descent from a common ancestor. These can be represented by (more or less resolved and controversial) language family trees. In theory, one can argue that language families should be built through the strict application of the comparative method of historical linguistics, but in practice this is not always the case, and there are several proposed classifications of languages into language families, each with its own advantages and disadvantages. A major stumbling block shared by most of them is that they are relatively difficult to use with computational methods, and in particular with phylogenetics. This is due to their lack of standardization, coupled with the general non-availability of branch length information, which encapsulates the amount of evolution taking place on the family tree. In this paper I introduce a method (and its implementation in R) that converts the language classifications provided by four widely-used databases (Ethnologue, WALS, AUTOTYP and Glottolog) into the de facto Newick standard generally used in phylogenetics, aligns the four most used conventions for unique identifiers of linguistic entities (ISO 639-3, WALS, AUTOTYP and Glottocode), and adds branch length information from a variety of sources (the tree’s own topology, an externally given numeric constant, or a distance matrix). The R scripts, input data and resulting Newick trees are available under liberal open-source licenses in a GitHub repository (https://github.com/ddediu/lgfam-newick), to encourage and promote the use of phylogenetic methods to investigate linguistic diversity and its temporal dynamics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Language Dynamics and Change LANGUAGE & LINGUISTICS-

CiteScore

2.30

自引率

0.00%

发文量

期刊介绍： Language Dynamics and Change (LDC) is an international peer-reviewed journal that covers both new and traditional aspects of the study of language change. Work on any language or language family is welcomed, as long as it bears on topics that are also of theoretical interest. A particular focus is on new developments in the field arising from the accumulation of extensive databases of dialect variation and typological distributions, spoken corpora, parallel texts, and comparative lexicons, which allow for the application of new types of quantitative approaches to diachronic linguistics. Moreover, the journal will serve as an outlet for increasingly important interdisciplinary work on such topics as the evolution of language, archaeology and linguistics (‘archaeolinguistics’), human genetic and linguistic prehistory, and the computational modeling of language dynamics.