MizBERT: A Mizo BERT Model

IF 1.8 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Robert Lalramhluna, Sandeep Dash, Dr.Partha Pakray
{"title":"MizBERT: A Mizo BERT Model","authors":"Robert Lalramhluna, Sandeep Dash, Dr.Partha Pakray","doi":"10.1145/3666003","DOIUrl":null,"url":null,"abstract":"<p>This research investigates the utilization of pre-trained BERT transformers within the context of the Mizo language. BERT, an abbreviation for Bidirectional Encoder Representations from Transformers, symbolizes Google’s forefront neural network approach to Natural Language Processing (NLP), renowned for its remarkable performance across various NLP tasks. However, its efficacy in handling low-resource languages such as Mizo remains largely unexplored. In this study, we introduce <i>MizBERT</i>, a specialized Mizo language model. Through extensive pre-training on a corpus collected from diverse online platforms, <i>MizBERT</i> has been tailored to accommodate the nuances of the Mizo language. Evaluation of <i>MizBERT’s</i> capabilities is conducted using two primary metrics: Masked Language Modeling (MLM) and Perplexity, yielding scores of 76.12% and 3.2565, respectively. Additionally, its performance in a text classification task is examined. Results indicate that <i>MizBERT</i> outperforms both the multilingual BERT (mBERT) model and the Support Vector Machine (SVM) algorithm, achieving an accuracy of 98.92%. This underscores <i>MizBERT’s</i> proficiency in understanding and processing the intricacies inherent in the Mizo language.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"32 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3666003","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

This research investigates the utilization of pre-trained BERT transformers within the context of the Mizo language. BERT, an abbreviation for Bidirectional Encoder Representations from Transformers, symbolizes Google’s forefront neural network approach to Natural Language Processing (NLP), renowned for its remarkable performance across various NLP tasks. However, its efficacy in handling low-resource languages such as Mizo remains largely unexplored. In this study, we introduce MizBERT, a specialized Mizo language model. Through extensive pre-training on a corpus collected from diverse online platforms, MizBERT has been tailored to accommodate the nuances of the Mizo language. Evaluation of MizBERT’s capabilities is conducted using two primary metrics: Masked Language Modeling (MLM) and Perplexity, yielding scores of 76.12% and 3.2565, respectively. Additionally, its performance in a text classification task is examined. Results indicate that MizBERT outperforms both the multilingual BERT (mBERT) model and the Support Vector Machine (SVM) algorithm, achieving an accuracy of 98.92%. This underscores MizBERT’s proficiency in understanding and processing the intricacies inherent in the Mizo language.

MizBERT:水族 BERT 模型
本研究调查了在米佐语中使用预训练 BERT 变换器的情况。BERT 是 Bidirectional Encoder Representations from Transformers 的缩写,象征着谷歌自然语言处理(NLP)的前沿神经网络方法,因其在各种 NLP 任务中的出色表现而闻名。然而,它在处理低资源语言(如米佐语)方面的功效在很大程度上仍未得到探索。在本研究中,我们介绍了 MizBERT,一种专门的水族语言模型。通过在从不同在线平台收集的语料库上进行广泛的预训练,MizBERT 已经适应了米佐语的细微差别。对 MizBERT 能力的评估主要采用两个指标:屏蔽语言建模(MLM)和复杂度(Perplexity)的得分分别为 76.12% 和 3.2565。此外,还考察了它在文本分类任务中的表现。结果表明,MizBERT 的表现优于多语言 BERT(mBERT)模型和支持向量机(SVM)算法,准确率达到 98.92%。这凸显了 MizBERT 在理解和处理米佐语言内在复杂性方面的能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.60
自引率
15.00%
发文量
241
期刊介绍: The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to: -Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc. -Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc. -Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition. -Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc. -Machine Translation involving Asian or low-resource languages. -Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc. -Information Extraction and Filtering: including automatic abstraction, user profiling, etc. -Speech processing: including text-to-speech synthesis and automatic speech recognition. -Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc. -Cross-lingual information processing involving Asian or low-resource languages. -Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信