I. A. Khabutdinov, A. V. Chashchin, A. V. Grabovoy, A. S. Kildyakov, U. V. Chekhovich
{"title":"RuGECToR: Rule-Based Neural Network Model for Russian Language Grammatical Error Correction","authors":"I. A. Khabutdinov, A. V. Chashchin, A. V. Grabovoy, A. S. Kildyakov, U. V. Chekhovich","doi":"10.1134/s0361768824700129","DOIUrl":null,"url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Abstract</h3><p>Grammatical error correction is one of the core natural language processing tasks. Presently, the open-source state-of-the-art sequence tagging for English is the GECToR model. For Russian, this problem does not have equally effective solutions due to the lack of annotated datasets, which motivated the current research. In this paper, we describe the process of creating a synthetic dataset and training the model on it. The GECToR architecture is adapted for the Russian language, and it is called RuGECToR. This architecture is chosen because, unlike the sequence-to-sequence approach, it is easy to interpret and does not require a lot of training data. The aim is to train the model in such a way that it generalizes the morphological properties of the language rather than adapts to a specific training sample. The presented model achieves the quality of <b>82.5</b> in the metric <span>\\({{{\\mathbf{F}}}_{{{\\mathbf{0}}{\\mathbf{.5}}}}}\\)</span> on synthetic data and <b>22.2</b> on the RULEC dataset, which was not used at the training stage.</p>","PeriodicalId":54555,"journal":{"name":"Programming and Computer Software","volume":null,"pages":null},"PeriodicalIF":0.7000,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Programming and Computer Software","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1134/s0361768824700129","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Grammatical error correction is one of the core natural language processing tasks. Presently, the open-source state-of-the-art sequence tagging for English is the GECToR model. For Russian, this problem does not have equally effective solutions due to the lack of annotated datasets, which motivated the current research. In this paper, we describe the process of creating a synthetic dataset and training the model on it. The GECToR architecture is adapted for the Russian language, and it is called RuGECToR. This architecture is chosen because, unlike the sequence-to-sequence approach, it is easy to interpret and does not require a lot of training data. The aim is to train the model in such a way that it generalizes the morphological properties of the language rather than adapts to a specific training sample. The presented model achieves the quality of 82.5 in the metric \({{{\mathbf{F}}}_{{{\mathbf{0}}{\mathbf{.5}}}}}\) on synthetic data and 22.2 on the RULEC dataset, which was not used at the training stage.
期刊介绍:
Programming and Computer Software is a peer reviewed journal devoted to problems in all areas of computer science: operating systems, compiler technology, software engineering, artificial intelligence, etc.