{"title":"Enhancing grammatical documentation for endangered languages with graph-based meaning representation and Loopy Belief Propagation","authors":"Sebastien Christian","doi":"10.1016/j.nlp.2025.100164","DOIUrl":null,"url":null,"abstract":"<div><div>DIG4EL (Digital Inferential Grammars for Endangered Languages) is a method embodied in software designed to assist linguists and teachers in producing grammatical descriptions for endangered languages. DIG4EL integrates linguistic knowledge from extensive databases such as WALS and Grambank with automated observations of controlled data collected using Conversational Questionnaires.</div><div>Linguistic knowledge and automated observations provide priors to a Bayesian network of grammatical parameters, where parameters are interconnected by directional conditional probability matrices derived from statistics on world languages. Inference of unknown parameter values is performed using Loopy Belief Propagation, achieving an average accuracy of 76% and a median accuracy of 85% in an experimental grammatical domain, determining the values of eight parameters related to canonical word order across 116 languages from diverse language families.</div><div>DIG4EL produces outputs either as structured files for computational use, Microsoft Word files, or plain-language grammatical descriptions generated by a Large Language Model. These descriptions rely solely on vetted data and observed examples, with prompts crafted explicitly to prevent external information or hallucinations.</div><div>By leveraging probabilistic modeling and rich, yet quickly assembled linguistic data, DIG4EL provides a powerful, accessible tool for creating grammatical descriptions and language teaching materials with minimal intervention from linguists. It significantly reduces the time and expertise required for traditional documentation workflows, ensuring endangered languages are better documented and taught.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"12 ","pages":"Article 100164"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719125000408","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
DIG4EL (Digital Inferential Grammars for Endangered Languages) is a method embodied in software designed to assist linguists and teachers in producing grammatical descriptions for endangered languages. DIG4EL integrates linguistic knowledge from extensive databases such as WALS and Grambank with automated observations of controlled data collected using Conversational Questionnaires.
Linguistic knowledge and automated observations provide priors to a Bayesian network of grammatical parameters, where parameters are interconnected by directional conditional probability matrices derived from statistics on world languages. Inference of unknown parameter values is performed using Loopy Belief Propagation, achieving an average accuracy of 76% and a median accuracy of 85% in an experimental grammatical domain, determining the values of eight parameters related to canonical word order across 116 languages from diverse language families.
DIG4EL produces outputs either as structured files for computational use, Microsoft Word files, or plain-language grammatical descriptions generated by a Large Language Model. These descriptions rely solely on vetted data and observed examples, with prompts crafted explicitly to prevent external information or hallucinations.
By leveraging probabilistic modeling and rich, yet quickly assembled linguistic data, DIG4EL provides a powerful, accessible tool for creating grammatical descriptions and language teaching materials with minimal intervention from linguists. It significantly reduces the time and expertise required for traditional documentation workflows, ensuring endangered languages are better documented and taught.