Enhancing grammatical documentation for endangered languages with graph-based meaning representation and Loopy Belief Propagation

Sebastien Christian
{"title":"Enhancing grammatical documentation for endangered languages with graph-based meaning representation and Loopy Belief Propagation","authors":"Sebastien Christian","doi":"10.1016/j.nlp.2025.100164","DOIUrl":null,"url":null,"abstract":"<div><div>DIG4EL (Digital Inferential Grammars for Endangered Languages) is a method embodied in software designed to assist linguists and teachers in producing grammatical descriptions for endangered languages. DIG4EL integrates linguistic knowledge from extensive databases such as WALS and Grambank with automated observations of controlled data collected using Conversational Questionnaires.</div><div>Linguistic knowledge and automated observations provide priors to a Bayesian network of grammatical parameters, where parameters are interconnected by directional conditional probability matrices derived from statistics on world languages. Inference of unknown parameter values is performed using Loopy Belief Propagation, achieving an average accuracy of 76% and a median accuracy of 85% in an experimental grammatical domain, determining the values of eight parameters related to canonical word order across 116 languages from diverse language families.</div><div>DIG4EL produces outputs either as structured files for computational use, Microsoft Word files, or plain-language grammatical descriptions generated by a Large Language Model. These descriptions rely solely on vetted data and observed examples, with prompts crafted explicitly to prevent external information or hallucinations.</div><div>By leveraging probabilistic modeling and rich, yet quickly assembled linguistic data, DIG4EL provides a powerful, accessible tool for creating grammatical descriptions and language teaching materials with minimal intervention from linguists. It significantly reduces the time and expertise required for traditional documentation workflows, ensuring endangered languages are better documented and taught.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"12 ","pages":"Article 100164"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719125000408","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

DIG4EL (Digital Inferential Grammars for Endangered Languages) is a method embodied in software designed to assist linguists and teachers in producing grammatical descriptions for endangered languages. DIG4EL integrates linguistic knowledge from extensive databases such as WALS and Grambank with automated observations of controlled data collected using Conversational Questionnaires.
Linguistic knowledge and automated observations provide priors to a Bayesian network of grammatical parameters, where parameters are interconnected by directional conditional probability matrices derived from statistics on world languages. Inference of unknown parameter values is performed using Loopy Belief Propagation, achieving an average accuracy of 76% and a median accuracy of 85% in an experimental grammatical domain, determining the values of eight parameters related to canonical word order across 116 languages from diverse language families.
DIG4EL produces outputs either as structured files for computational use, Microsoft Word files, or plain-language grammatical descriptions generated by a Large Language Model. These descriptions rely solely on vetted data and observed examples, with prompts crafted explicitly to prevent external information or hallucinations.
By leveraging probabilistic modeling and rich, yet quickly assembled linguistic data, DIG4EL provides a powerful, accessible tool for creating grammatical descriptions and language teaching materials with minimal intervention from linguists. It significantly reduces the time and expertise required for traditional documentation workflows, ensuring endangered languages are better documented and taught.

Abstract Image

利用基于图的意义表示和循环信念传播增强濒危语言的语法文档
DIG4EL(濒危语言的数字推理语法)是一种体现在软件中的方法,旨在帮助语言学家和教师为濒危语言制作语法描述。DIG4EL集成了来自广泛数据库(如WALS和Grambank)的语言知识,以及使用会话问卷收集的受控数据的自动观察。语言知识和自动观察为语法参数的贝叶斯网络提供了先验,其中参数通过来自世界语言统计的定向条件概率矩阵相互连接。使用Loopy Belief Propagation进行未知参数值的推断,在实验语法领域实现了76%的平均准确率和85%的中位数准确率,确定了来自不同语族的116种语言中与规范词序相关的8个参数的值。DIG4EL产生的输出可以是供计算使用的结构化文件、Microsoft Word文件,或者是由大型语言模型生成的纯语言语法描述。这些描述完全依赖于经过审查的数据和观察到的例子,并带有精心设计的提示,以防止外部信息或幻觉。通过利用概率建模和丰富的、快速组装的语言数据,DIG4EL提供了一个强大的、可访问的工具,用于创建语法描述和语言教学材料,而无需语言学家的干预。它大大减少了传统文档工作流程所需的时间和专业知识,确保濒危语言得到更好的文档和教授。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信