Pre-Training MLM Using Bert for the Albanian Language

Labehat Kryeziu, Visar Shehu
{"title":"Pre-Training MLM Using Bert for the Albanian Language","authors":"Labehat Kryeziu, Visar Shehu","doi":"10.2478/seeur-2023-0035","DOIUrl":null,"url":null,"abstract":"Abstract Knowing that language is often used as a classifier of human intelligence and the development of systems that understand human language remains a challenge all the time (Kryeziu & Shehu, 2022). Natural Language Processing is a very active field of study, where transformers have a key role. Transformers function based on neural networks and they are increasingly showing promising results. One of the first major contributions to transfer learning in Natural Language Processing was the use of pre-trained word embeddings in 2010 (Joseph, Lev, & Yoshua, 2010). Pre-trained models like ELMo (Matthew, et al., 2018) and BERT (Delvin, et al., 2019) are trained on large corpora of unlabeled text and as a result learning from text representations has achieved good performance on many of the underlying tasks on datasets from different domains. Pre-training in the language model has proven that there has been an improvement in some aspects of natural language processing, based on the paper (Dai & Le, 2015). In present paper, we will pre-train BERT on the task of Masked Language Modeling (MLM) with the Albanian language dataset (alb_dataset) that we have created for this purpose (Kryeziu et al., 2022). We will compare two approaches: training of BERT using the available OSCAR dataset and using our alb_dataset that we have collected. The paper shows some discrepancies during training, especially while evaluating the performance of the model.","PeriodicalId":332987,"journal":{"name":"SEEU Review","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SEEU Review","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/seeur-2023-0035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract Knowing that language is often used as a classifier of human intelligence and the development of systems that understand human language remains a challenge all the time (Kryeziu & Shehu, 2022). Natural Language Processing is a very active field of study, where transformers have a key role. Transformers function based on neural networks and they are increasingly showing promising results. One of the first major contributions to transfer learning in Natural Language Processing was the use of pre-trained word embeddings in 2010 (Joseph, Lev, & Yoshua, 2010). Pre-trained models like ELMo (Matthew, et al., 2018) and BERT (Delvin, et al., 2019) are trained on large corpora of unlabeled text and as a result learning from text representations has achieved good performance on many of the underlying tasks on datasets from different domains. Pre-training in the language model has proven that there has been an improvement in some aspects of natural language processing, based on the paper (Dai & Le, 2015). In present paper, we will pre-train BERT on the task of Masked Language Modeling (MLM) with the Albanian language dataset (alb_dataset) that we have created for this purpose (Kryeziu et al., 2022). We will compare two approaches: training of BERT using the available OSCAR dataset and using our alb_dataset that we have collected. The paper shows some discrepancies during training, especially while evaluating the performance of the model.
使用Bert对阿尔巴尼亚语进行预训练的传销
要知道语言经常被用作人类智能的分类器,开发理解人类语言的系统一直是一个挑战(Kryeziu & Shehu, 2022)。自然语言处理是一个非常活跃的研究领域,其中变形器起着关键作用。变压器的功能是基于神经网络的,它们越来越多地显示出有希望的结果。自然语言处理中迁移学习的第一个主要贡献是2010年使用预训练词嵌入(Joseph, Lev, & Yoshua, 2010)。像ELMo (Matthew等人,2018)和BERT (Delvin等人,2019)这样的预训练模型是在未标记文本的大型语料库上训练的,因此从文本表示中学习在来自不同领域的数据集上的许多底层任务上取得了良好的性能。基于论文(Dai & Le, 2015),语言模型的预训练已经证明,自然语言处理的某些方面已经有所改善。在本文中,我们将使用我们为此目的创建的阿尔巴尼亚语数据集(alb_dataset)对BERT进行屏蔽语言建模(MLM)任务的预训练(Kryeziu et al., 2022)。我们将比较两种方法:使用可用的OSCAR数据集和使用我们收集的alb_dataset来训练BERT。本文在训练过程中发现了一些差异,特别是在评估模型性能时。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信