lexbank 2：为大规模词法数据预先计算的特征。

Open research Europe Pub Date : 2025-06-23 eCollection Date: 2025-01-01 DOI:10.12688/openreseurope.20216.2

Frederic Blum, Carlos Barrientos, Johannes Englisch, Robert Forkel, Simon J Greenhill, Christoph Rzymski, Johann-Mattis List

{"title":"lexbank 2：为大规模词法数据预先计算的特征。","authors":"Frederic Blum, Carlos Barrientos, Johannes Englisch, Robert Forkel, Simon J Greenhill, Christoph Rzymski, Johann-Mattis List","doi":"10.12688/openreseurope.20216.2","DOIUrl":null,"url":null,"abstract":"Large-scale lexical and grammatical datasets nowadays play an important role in comparative linguistics. However, the lack of standardization remains a challenge exacerbating extension and reuse of published data. We present an updated version of Lexibank, a large-scale lexical dataset, expanding on previous efforts to standardize and unify cross-linguistic data. This new version includes over 3,100 languages and more than one-and-a-half million word forms, substantially broadening the scope and utility of the previous resource. Our dataset has been systematically curated using a dedicated computer-assisted workflow designed specifically for the lifting of published wordlist data to the standards recommended by the Cross-Linguistic Data Formats initiative. The expanded dataset features standardized references to language varieties, standardized semantic glosses that reference the concepts expressed by individual word forms, and standardized phonetic transcriptions for all word forms that our repository contains. Based on those standardizations we pre-compute semantic and phonological features, which can be used to carry out extensive automated analyses. We illustrate this potential by providing dedicated database queries to (1) infer words that are similar in pronunciation and meaning, (2) identify concepts that are colexified across languages in our sample, and (3) assess the semantic diversity of etymologically related words. These queries are not only fast to execute but also global in their scope, due to the largescale coverage provided by Lexibank 2. The queries are also easy to extend, thus having the potential to contribute to various studies in historical linguistics, linguistic typology, and related disciplines. The updated dataset is a substantial step forward in the effort to create comprehensive, standardized, and accessible linguistic resources.","PeriodicalId":74359,"journal":{"name":"Open research Europe","volume":"5 ","pages":"126"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12134731/pdf/","citationCount":"0","resultStr":"{\"title\":\"Lexibank 2: pre-computed features for large-scale lexical data.\",\"authors\":\"Frederic Blum, Carlos Barrientos, Johannes Englisch, Robert Forkel, Simon J Greenhill, Christoph Rzymski, Johann-Mattis List\",\"doi\":\"10.12688/openreseurope.20216.2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large-scale lexical and grammatical datasets nowadays play an important role in comparative linguistics. However, the lack of standardization remains a challenge exacerbating extension and reuse of published data. We present an updated version of Lexibank, a large-scale lexical dataset, expanding on previous efforts to standardize and unify cross-linguistic data. This new version includes over 3,100 languages and more than one-and-a-half million word forms, substantially broadening the scope and utility of the previous resource. Our dataset has been systematically curated using a dedicated computer-assisted workflow designed specifically for the lifting of published wordlist data to the standards recommended by the Cross-Linguistic Data Formats initiative. The expanded dataset features standardized references to language varieties, standardized semantic glosses that reference the concepts expressed by individual word forms, and standardized phonetic transcriptions for all word forms that our repository contains. Based on those standardizations we pre-compute semantic and phonological features, which can be used to carry out extensive automated analyses. We illustrate this potential by providing dedicated database queries to (1) infer words that are similar in pronunciation and meaning, (2) identify concepts that are colexified across languages in our sample, and (3) assess the semantic diversity of etymologically related words. These queries are not only fast to execute but also global in their scope, due to the largescale coverage provided by Lexibank 2. The queries are also easy to extend, thus having the potential to contribute to various studies in historical linguistics, linguistic typology, and related disciplines. The updated dataset is a substantial step forward in the effort to create comprehensive, standardized, and accessible linguistic resources.\",\"PeriodicalId\":74359,\"journal\":{\"name\":\"Open research Europe\",\"volume\":\"5 \",\"pages\":\"126\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12134731/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Open research Europe\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.12688/openreseurope.20216.2\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Open research Europe","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12688/openreseurope.20216.2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

目前，大规模的词汇和语法数据集在比较语言学中发挥着重要的作用。然而，缺乏标准化仍然是一个挑战，加剧了已发布数据的扩展和重用。我们提出了lexbank的更新版本，这是一个大规模的词汇数据集，扩展了以前标准化和统一跨语言数据的努力。这个新版本包括超过3100种语言和超过150万个单词形式，大大扩大了以前资源的范围和效用。我们的数据集使用专用的计算机辅助工作流程进行了系统的整理，专门用于将已发布的词表数据提升到跨语言数据格式倡议推荐的标准。扩展的数据集具有对语言变体的标准化引用，引用单个词形表达的概念的标准化语义注释，以及我们的存储库包含的所有词形的标准化音标。基于这些标准化，我们预先计算语义和语音特征，这些特征可用于进行广泛的自动化分析。我们通过提供专门的数据库查询来说明这一潜力：(1)推断发音和含义相似的单词，(2)识别样本中跨语言共色的概念，以及(3)评估词源相关单词的语义多样性。由于lexbank 2提供的大规模覆盖，这些查询不仅执行速度快，而且在其范围内是全局的。这些查询也很容易扩展，因此有可能为历史语言学、语言类型学和相关学科的各种研究做出贡献。更新的数据集在努力创建全面、标准化和可访问的语言资源方面向前迈出了实质性的一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Lexibank 2: pre-computed features for large-scale lexical data.

查看原文本刊更多论文

Lexibank 2: pre-computed features for large-scale lexical data.

Large-scale lexical and grammatical datasets nowadays play an important role in comparative linguistics. However, the lack of standardization remains a challenge exacerbating extension and reuse of published data. We present an updated version of Lexibank, a large-scale lexical dataset, expanding on previous efforts to standardize and unify cross-linguistic data. This new version includes over 3,100 languages and more than one-and-a-half million word forms, substantially broadening the scope and utility of the previous resource. Our dataset has been systematically curated using a dedicated computer-assisted workflow designed specifically for the lifting of published wordlist data to the standards recommended by the Cross-Linguistic Data Formats initiative. The expanded dataset features standardized references to language varieties, standardized semantic glosses that reference the concepts expressed by individual word forms, and standardized phonetic transcriptions for all word forms that our repository contains. Based on those standardizations we pre-compute semantic and phonological features, which can be used to carry out extensive automated analyses. We illustrate this potential by providing dedicated database queries to (1) infer words that are similar in pronunciation and meaning, (2) identify concepts that are colexified across languages in our sample, and (3) assess the semantic diversity of etymologically related words. These queries are not only fast to execute but also global in their scope, due to the largescale coverage provided by Lexibank 2. The queries are also easy to extend, thus having the potential to contribute to various studies in historical linguistics, linguistic typology, and related disciplines. The updated dataset is a substantial step forward in the effort to create comprehensive, standardized, and accessible linguistic resources.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Open research Europe

CiteScore

1.50

自引率

0.00%

发文量