Language-based transfer learning approaches for part-of-speech tagging on Saint Petersburg Corpus of Hagiographic texts (SKAT)

IF 0.1 0 HUMANITIES, MULTIDISCIPLINARY
Vadim V. Gudkov, Olga V. Mitrenina, Evgenii G. Sokolov, Angelina A. Koval
{"title":"Language-based transfer learning approaches for part-of-speech tagging on Saint Petersburg Corpus of Hagiographic texts (SKAT)","authors":"Vadim V. Gudkov, Olga V. Mitrenina, Evgenii G. Sokolov, Angelina A. Koval","doi":"10.21638/spbu09.2023.205","DOIUrl":null,"url":null,"abstract":"The article describes an experiment about training a part-of-speech tagger using artificial neural networks on the St. Petersburg Corpus of Hagiographic Texts (SKAT), which is being developed at the Department of Mathematical Linguistics of St. Petersburg State University. The corpus includes the texts of 23 manuscripts dating from the 15th–18th centuries with about 190,000 words usages, four of which were labelled manually. The bi-LSTM, distilled RuBERTtiny2 and RuBERT models were used to train a POS tagger. All of them were trained on modern Russian corpora and further fine-tuned to label Old Russian texts using a technique called language transfer. To fine-tune transformer-based language models it was necessary to tokenize the texts using byte pair encoding and map tokens from the original Russian-language tokenizer to the new one based on indices. Then the model was fine-tuned for the token classification task. To fine-tune the model, a tagged subcorpus of three hagiographical texts was used, which included 35,603 tokens and 2,885 sentences. The experiment took into account only the tags of the parts of speech, the classification included seventeen tags, thirteen of which corresponded to parts of speech, and the remaining four marked punctuation marks. To evaluate the quality of the model, the standard metrics F1 and Accuracy were used. According to automatic evaluation metrics, the RuBERT model showed the best result. Most of the errors were related to incorrect generalization of linear position patterns or to the similarity of word forms in both the extreme left and extreme right positions.","PeriodicalId":41205,"journal":{"name":"Vestnik Sankt-Peterburgskogo Universiteta-Yazyk i Literatura","volume":"13 1","pages":"0"},"PeriodicalIF":0.1000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Vestnik Sankt-Peterburgskogo Universiteta-Yazyk i Literatura","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21638/spbu09.2023.205","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"HUMANITIES, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

The article describes an experiment about training a part-of-speech tagger using artificial neural networks on the St. Petersburg Corpus of Hagiographic Texts (SKAT), which is being developed at the Department of Mathematical Linguistics of St. Petersburg State University. The corpus includes the texts of 23 manuscripts dating from the 15th–18th centuries with about 190,000 words usages, four of which were labelled manually. The bi-LSTM, distilled RuBERTtiny2 and RuBERT models were used to train a POS tagger. All of them were trained on modern Russian corpora and further fine-tuned to label Old Russian texts using a technique called language transfer. To fine-tune transformer-based language models it was necessary to tokenize the texts using byte pair encoding and map tokens from the original Russian-language tokenizer to the new one based on indices. Then the model was fine-tuned for the token classification task. To fine-tune the model, a tagged subcorpus of three hagiographical texts was used, which included 35,603 tokens and 2,885 sentences. The experiment took into account only the tags of the parts of speech, the classification included seventeen tags, thirteen of which corresponded to parts of speech, and the remaining four marked punctuation marks. To evaluate the quality of the model, the standard metrics F1 and Accuracy were used. According to automatic evaluation metrics, the RuBERT model showed the best result. Most of the errors were related to incorrect generalization of linear position patterns or to the similarity of word forms in both the extreme left and extreme right positions.
圣彼得堡圣像文本语料库词性标注的语言迁移学习方法
本文描述了在圣彼得堡国立大学数学语言学系正在开发的圣彼得堡圣像文本语料库(SKAT)上,利用人工神经网络训练词性标注器的实验。该语料库包括23份15 - 18世纪的手稿文本,约有19万个单词的用法,其中4个是手工标注的。使用bi-LSTM、提炼的RuBERTtiny2和RuBERT模型训练POS标注器。他们都接受了现代俄语语料库的训练,并使用一种称为语言迁移的技术对古俄语文本进行了进一步的微调。为了微调基于转换器的语言模型,有必要使用字节对编码对文本进行标记,并将标记从原始的俄语标记器映射到基于索引的新标记器。然后针对令牌分类任务对模型进行微调。为了对模型进行微调,使用了三个圣徒文本的标记子语料库,其中包括35,603个标记和2,885个句子。实验只考虑词类标签,分类包括17个标签,其中13个标签对应词类,其余4个标记标点符号。为了评价模型的质量,使用了标准指标F1和Accuracy。根据自动评价指标,RuBERT模型表现出最好的结果。大多数错误与线性位置模式的错误概括或极端左和极端右位置的词形相似有关。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
0.50
自引率
0.00%
发文量
12
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信