Pre-Training MLM Using Bert for the Albanian Language

SEEU Review Pub Date : 2023-06-01 DOI:10.2478/seeur-2023-0035

Labehat Kryeziu, Visar Shehu

{"title":"Pre-Training MLM Using Bert for the Albanian Language","authors":"Labehat Kryeziu, Visar Shehu","doi":"10.2478/seeur-2023-0035","DOIUrl":null,"url":null,"abstract":"Abstract Knowing that language is often used as a classifier of human intelligence and the development of systems that understand human language remains a challenge all the time (Kryeziu & Shehu, 2022). Natural Language Processing is a very active field of study, where transformers have a key role. Transformers function based on neural networks and they are increasingly showing promising results. One of the first major contributions to transfer learning in Natural Language Processing was the use of pre-trained word embeddings in 2010 (Joseph, Lev, & Yoshua, 2010). Pre-trained models like ELMo (Matthew, et al., 2018) and BERT (Delvin, et al., 2019) are trained on large corpora of unlabeled text and as a result learning from text representations has achieved good performance on many of the underlying tasks on datasets from different domains. Pre-training in the language model has proven that there has been an improvement in some aspects of natural language processing, based on the paper (Dai & Le, 2015). In present paper, we will pre-train BERT on the task of Masked Language Modeling (MLM) with the Albanian language dataset (alb_dataset) that we have created for this purpose (Kryeziu et al., 2022). We will compare two approaches: training of BERT using the available OSCAR dataset and using our alb_dataset that we have collected. The paper shows some discrepancies during training, especially while evaluating the performance of the model.","PeriodicalId":332987,"journal":{"name":"SEEU Review","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SEEU Review","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/seeur-2023-0035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract Knowing that language is often used as a classifier of human intelligence and the development of systems that understand human language remains a challenge all the time (Kryeziu & Shehu, 2022). Natural Language Processing is a very active field of study, where transformers have a key role. Transformers function based on neural networks and they are increasingly showing promising results. One of the first major contributions to transfer learning in Natural Language Processing was the use of pre-trained word embeddings in 2010 (Joseph, Lev, & Yoshua, 2010). Pre-trained models like ELMo (Matthew, et al., 2018) and BERT (Delvin, et al., 2019) are trained on large corpora of unlabeled text and as a result learning from text representations has achieved good performance on many of the underlying tasks on datasets from different domains. Pre-training in the language model has proven that there has been an improvement in some aspects of natural language processing, based on the paper (Dai & Le, 2015). In present paper, we will pre-train BERT on the task of Masked Language Modeling (MLM) with the Albanian language dataset (alb_dataset) that we have created for this purpose (Kryeziu et al., 2022). We will compare two approaches: training of BERT using the available OSCAR dataset and using our alb_dataset that we have collected. The paper shows some discrepancies during training, especially while evaluating the performance of the model.

查看原文本刊更多论文

使用Bert对阿尔巴尼亚语进行预训练的传销

要知道语言经常被用作人类智能的分类器，开发理解人类语言的系统一直是一个挑战(Kryeziu & Shehu, 2022)。自然语言处理是一个非常活跃的研究领域，其中变形器起着关键作用。变压器的功能是基于神经网络的，它们越来越多地显示出有希望的结果。自然语言处理中迁移学习的第一个主要贡献是2010年使用预训练词嵌入(Joseph, Lev， & Yoshua, 2010)。像ELMo (Matthew等人，2018)和BERT (Delvin等人，2019)这样的预训练模型是在未标记文本的大型语料库上训练的，因此从文本表示中学习在来自不同领域的数据集上的许多底层任务上取得了良好的性能。基于论文(Dai & Le, 2015)，语言模型的预训练已经证明，自然语言处理的某些方面已经有所改善。在本文中，我们将使用我们为此目的创建的阿尔巴尼亚语数据集(alb_dataset)对BERT进行屏蔽语言建模(MLM)任务的预训练(Kryeziu et al.， 2022)。我们将比较两种方法:使用可用的OSCAR数据集和使用我们收集的alb_dataset来训练BERT。本文在训练过程中发现了一些差异，特别是在评估模型性能时。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

SEEU Review

自引率

0.00%

发文量