语言(历史)变异对东斯拉夫语 Lects Lematisers 表演的影响

Ilia Afanasev, Olga Lyashevskaya, Stefan Rebrikov, Yana Shishkina, Igor Trofimov, Natalia Vlasova
{"title":"语言(历史)变异对东斯拉夫语 Lects Lematisers 表演的影响","authors":"Ilia Afanasev, Olga Lyashevskaya, Stefan Rebrikov, Yana Shishkina, Igor Trofimov, Natalia Vlasova","doi":"10.2478/jazcas-2023-0040","DOIUrl":null,"url":null,"abstract":"Abstract The need to develop tools for historical and regional variations is becoming more urgent in natural language processing. In this paper, we present two candidate systems for lemmatising historical East Slavic lects (Late Old East Slavic and Middle Russian), as well as modern regional East Slavic lects (Belogornoje and Megra): BERT-based end-to-end pipeline with language-specific heuristics and sequence-to-sequence BART-based encoderdecoder. To evaluate their predictions, we use accuracy score and string similarity measures, such as Levenshtein distance. The BERT-based model is more suitable for the regional data, achieving 85% accuracy score, and only 74% on the historical data. BART-based model climbs up to 92.6% accuracy score on the historical data, yet gets only 80% on the regional data. We provide an error analysis and discuss ways to enhance models, such as dictionary lookup and spellchecker.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"20 1","pages":"225 - 233"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Effect of (Historical) Language Variation on the East Slavic Lects Lematisers Performance\",\"authors\":\"Ilia Afanasev, Olga Lyashevskaya, Stefan Rebrikov, Yana Shishkina, Igor Trofimov, Natalia Vlasova\",\"doi\":\"10.2478/jazcas-2023-0040\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract The need to develop tools for historical and regional variations is becoming more urgent in natural language processing. In this paper, we present two candidate systems for lemmatising historical East Slavic lects (Late Old East Slavic and Middle Russian), as well as modern regional East Slavic lects (Belogornoje and Megra): BERT-based end-to-end pipeline with language-specific heuristics and sequence-to-sequence BART-based encoderdecoder. To evaluate their predictions, we use accuracy score and string similarity measures, such as Levenshtein distance. The BERT-based model is more suitable for the regional data, achieving 85% accuracy score, and only 74% on the historical data. BART-based model climbs up to 92.6% accuracy score on the historical data, yet gets only 80% on the regional data. We provide an error analysis and discuss ways to enhance models, such as dictionary lookup and spellchecker.\",\"PeriodicalId\":262732,\"journal\":{\"name\":\"Journal of Linguistics/Jazykovedný casopis\",\"volume\":\"20 1\",\"pages\":\"225 - 233\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Linguistics/Jazykovedný casopis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2478/jazcas-2023-0040\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Linguistics/Jazykovedný casopis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/jazcas-2023-0040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

摘要 在自然语言处理领域,开发历史和地区变体工具的需求日益迫切。本文提出了两个候选系统,用于对历史上的东斯拉夫语(晚期古东斯拉夫语和中古俄语)以及现代区域性东斯拉夫语(Belogornoje 和 Megra)进行词法化:它们是:基于 BERT 的端到端流水线与特定语言启发式和基于序列到序列 BART 的编码器-解码器。为了评估它们的预测结果,我们使用了准确度得分和字符串相似度量,如列文士坦距离。基于 BERT 的模型更适合区域数据,准确率达到 85%,而历史数据的准确率仅为 74%。基于 BART 的模型在历史数据上的准确率高达 92.6%,但在地区数据上的准确率仅为 80%。我们提供了错误分析,并讨论了增强模型的方法,如词典查询和拼写检查器。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
The Effect of (Historical) Language Variation on the East Slavic Lects Lematisers Performance
Abstract The need to develop tools for historical and regional variations is becoming more urgent in natural language processing. In this paper, we present two candidate systems for lemmatising historical East Slavic lects (Late Old East Slavic and Middle Russian), as well as modern regional East Slavic lects (Belogornoje and Megra): BERT-based end-to-end pipeline with language-specific heuristics and sequence-to-sequence BART-based encoderdecoder. To evaluate their predictions, we use accuracy score and string similarity measures, such as Levenshtein distance. The BERT-based model is more suitable for the regional data, achieving 85% accuracy score, and only 74% on the historical data. BART-based model climbs up to 92.6% accuracy score on the historical data, yet gets only 80% on the regional data. We provide an error analysis and discuss ways to enhance models, such as dictionary lookup and spellchecker.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信