LakotaBERT: Transformer based model for Low Resource Lakota Language

Procedia Computer Science Pub Date : 2025-01-01 DOI:10.1016/j.procs.2025.03.226

Kanishka Parankusham , Rodrigue Rizk , K C Santosh

{"title":"LakotaBERT: Transformer based model for Low Resource Lakota Language","authors":"Kanishka Parankusham , Rodrigue Rizk , K C Santosh","doi":"10.1016/j.procs.2025.03.226","DOIUrl":null,"url":null,"abstract":"<div><div>Lakota, a critically endangered language of the Sioux people in North America, faces significant challenges due to declining fluency among younger generations. This paper presents the development of LakotaBERT, the first large language model (LLM) tailored for Lakota, aiming to support language revitalization efforts. Our research has two primary objectives: (1) to create a comprehensive Lakota language corpus and (2) to develop a customized LLM for Lakota. We compiled a diverse corpus of 105K sentences in Lakota, English, and parallel texts from various sources, such as books and websites, emphasizing the cultural significance and historical context of the Lakota language. Utilizing the RoBERTa architecture, we pre-trained our model and conducted comparative evaluations against established models such as RoBERTa, BERT, and multilingual BERT. Initial results demonstrate a masked language modeling accuracy of 51% with a single ground truth assumption, showcasing performance comparable to that of English-based models. We also evaluated the model using additional metrics, such as precision and F1 score, to provide a comprehensive assessment of its capabilities. By integrating AI and linguistic methodologies, we aspire to enhance linguistic diversity and cultural resilience, setting a valuable precedent for leveraging technology in the revitalization of other endangered indigenous languages.</div></div>","PeriodicalId":20465,"journal":{"name":"Procedia Computer Science","volume":"260 ","pages":"Pages 486-497"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Procedia Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877050925009706","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Lakota, a critically endangered language of the Sioux people in North America, faces significant challenges due to declining fluency among younger generations. This paper presents the development of LakotaBERT, the first large language model (LLM) tailored for Lakota, aiming to support language revitalization efforts. Our research has two primary objectives: (1) to create a comprehensive Lakota language corpus and (2) to develop a customized LLM for Lakota. We compiled a diverse corpus of 105K sentences in Lakota, English, and parallel texts from various sources, such as books and websites, emphasizing the cultural significance and historical context of the Lakota language. Utilizing the RoBERTa architecture, we pre-trained our model and conducted comparative evaluations against established models such as RoBERTa, BERT, and multilingual BERT. Initial results demonstrate a masked language modeling accuracy of 51% with a single ground truth assumption, showcasing performance comparable to that of English-based models. We also evaluated the model using additional metrics, such as precision and F1 score, to provide a comprehensive assessment of its capabilities. By integrating AI and linguistic methodologies, we aspire to enhance linguistic diversity and cultural resilience, setting a valuable precedent for leveraging technology in the revitalization of other endangered indigenous languages.

查看原文本刊更多论文

LakotaBERT：基于转换器的低资源Lakota语言模型

拉科塔语是北美苏族的一种濒危语言，由于年轻一代的流利程度下降，它面临着巨大的挑战。本文介绍了LakotaBERT的开发，这是为Lakota量身定制的第一个大型语言模型（LLM），旨在支持语言振兴工作。我们的研究有两个主要目标：(1)创建一个全面的拉科塔语言语料库；(2)为拉科塔人开发一个定制的法学硕士。我们编制了一个包括拉科塔语、英语和各种来源（如书籍和网站）的平行文本的105K个句子的多样化语料库，强调拉科塔语的文化意义和历史背景。利用RoBERTa架构，我们预先训练了我们的模型，并与RoBERTa、BERT和多语言BERT等已建立的模型进行了比较评估。初步结果表明，在单一基础真值假设下，掩蔽语言建模准确率为51%，其性能与基于英语的模型相当。我们还使用其他指标（如精度和F1分数）对模型进行了评估，以提供对其功能的全面评估。通过整合人工智能和语言方法，我们希望加强语言多样性和文化复原力，为利用技术振兴其他濒危土著语言树立一个宝贵的先例。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Procedia Computer Science

CiteScore

4.50

自引率

0.00%

发文量