LakotaBERT: Transformer based model for Low Resource Lakota Language

Kanishka Parankusham , Rodrigue Rizk , K C Santosh
{"title":"LakotaBERT: Transformer based model for Low Resource Lakota Language","authors":"Kanishka Parankusham ,&nbsp;Rodrigue Rizk ,&nbsp;K C Santosh","doi":"10.1016/j.procs.2025.03.226","DOIUrl":null,"url":null,"abstract":"<div><div>Lakota, a critically endangered language of the Sioux people in North America, faces significant challenges due to declining fluency among younger generations. This paper presents the development of LakotaBERT, the first large language model (LLM) tailored for Lakota, aiming to support language revitalization efforts. Our research has two primary objectives: (1) to create a comprehensive Lakota language corpus and (2) to develop a customized LLM for Lakota. We compiled a diverse corpus of 105K sentences in Lakota, English, and parallel texts from various sources, such as books and websites, emphasizing the cultural significance and historical context of the Lakota language. Utilizing the RoBERTa architecture, we pre-trained our model and conducted comparative evaluations against established models such as RoBERTa, BERT, and multilingual BERT. Initial results demonstrate a masked language modeling accuracy of 51% with a single ground truth assumption, showcasing performance comparable to that of English-based models. We also evaluated the model using additional metrics, such as precision and F1 score, to provide a comprehensive assessment of its capabilities. By integrating AI and linguistic methodologies, we aspire to enhance linguistic diversity and cultural resilience, setting a valuable precedent for leveraging technology in the revitalization of other endangered indigenous languages.</div></div>","PeriodicalId":20465,"journal":{"name":"Procedia Computer Science","volume":"260 ","pages":"Pages 486-497"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Procedia Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877050925009706","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Lakota, a critically endangered language of the Sioux people in North America, faces significant challenges due to declining fluency among younger generations. This paper presents the development of LakotaBERT, the first large language model (LLM) tailored for Lakota, aiming to support language revitalization efforts. Our research has two primary objectives: (1) to create a comprehensive Lakota language corpus and (2) to develop a customized LLM for Lakota. We compiled a diverse corpus of 105K sentences in Lakota, English, and parallel texts from various sources, such as books and websites, emphasizing the cultural significance and historical context of the Lakota language. Utilizing the RoBERTa architecture, we pre-trained our model and conducted comparative evaluations against established models such as RoBERTa, BERT, and multilingual BERT. Initial results demonstrate a masked language modeling accuracy of 51% with a single ground truth assumption, showcasing performance comparable to that of English-based models. We also evaluated the model using additional metrics, such as precision and F1 score, to provide a comprehensive assessment of its capabilities. By integrating AI and linguistic methodologies, we aspire to enhance linguistic diversity and cultural resilience, setting a valuable precedent for leveraging technology in the revitalization of other endangered indigenous languages.
LakotaBERT:基于转换器的低资源Lakota语言模型
拉科塔语是北美苏族的一种濒危语言,由于年轻一代的流利程度下降,它面临着巨大的挑战。本文介绍了LakotaBERT的开发,这是为Lakota量身定制的第一个大型语言模型(LLM),旨在支持语言振兴工作。我们的研究有两个主要目标:(1)创建一个全面的拉科塔语言语料库;(2)为拉科塔人开发一个定制的法学硕士。我们编制了一个包括拉科塔语、英语和各种来源(如书籍和网站)的平行文本的105K个句子的多样化语料库,强调拉科塔语的文化意义和历史背景。利用RoBERTa架构,我们预先训练了我们的模型,并与RoBERTa、BERT和多语言BERT等已建立的模型进行了比较评估。初步结果表明,在单一基础真值假设下,掩蔽语言建模准确率为51%,其性能与基于英语的模型相当。我们还使用其他指标(如精度和F1分数)对模型进行了评估,以提供对其功能的全面评估。通过整合人工智能和语言方法,我们希望加强语言多样性和文化复原力,为利用技术振兴其他濒危土著语言树立一个宝贵的先例。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
4.50
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信