TECHNOLOGY FOR GRAMMATICAL ERRORS CORRECTION IN UKRAINIAN TEXT CONTENT BASED ON MACHINE LEARNING METHODS

IF 0.3 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Radio Electronics Computer Science Control Pub Date : 2023-02-27 DOI:10.15588/1607-3274-2023-1-12

N. Kholodna, V. Vysotska

{"title":"TECHNOLOGY FOR GRAMMATICAL ERRORS CORRECTION IN UKRAINIAN TEXT CONTENT BASED ON MACHINE LEARNING METHODS","authors":"N. Kholodna, V. Vysotska","doi":"10.15588/1607-3274-2023-1-12","DOIUrl":null,"url":null,"abstract":"Context. Most research in grammatical and stylistic error correction focuses on error correction in English-language textual content. Thanks to the availability of large data sets, a significant increase in the accuracy of English grammar correction has been achieved. Unfortunately, there are few studies on other languages. Systems for the English language are constantly developing and currently actively use machine learning methods: classification (sequence tagging) and machine translation. A large amount of parallel or manually labelled data is required to build a high-quality machine learning model for correcting grammatical/stylistic errors in the texts of those morphologically complex languages. Manual data annotation requires a lot of effort by professional linguists, which makes the creation of text corpora, especially in morphologically rich languages, mainly Ukrainian, a time- and resource-consuming process. \nObjective of the study is to develop a technology for correcting errors in Ukrainian-language texts based on machine learning methods using a small set of annotated parallel data. \nMethod. For this study, machine learning algorithms were selected when developing a system for correcting errors in Ukrainianlanguage texts using an optimal pipeline, including pre-processing and selecting text content and generating features in small annotated data corpora. The neural network’s use with a new architecture, a review of state-of-the-art methods, and a comparison of different pipeline stages will make it possible to determine such a combination of them, allowing a high-quality error correction model in Ukrainian-language texts. \nResults. A machine learning model for error correction in Ukrainian-language texts has been developed. A universal scheme for creating an error correction system for different languages is proposed. According to the results, the neural network can correct simple sentences written in Ukrainian. However, creating a full-fledged system will require spell-checking using dictionaries and checking rules, both simple and based on the result of parsing dependencies or other features. The pre-trained neural translation model mT5 has the best performance among the three models. To save computing resources, it is also possible to use a pre-trained BERT-type neural network as an encoder and a decoder. Such a neural network has half the number of parameters as other pretrained machine translation models and shows satisfactory results in correcting grammatical and stylistic errors. \nConclusions. The created model shows excellent classification results on test data. The calculated machine translation quality metrics allow only a partial comparison of the models since most of the words and phrases in the original and corrected sentences are the same. The best value for both BLEU (0.908) and METEOR (0.956) is obtained for mT5, which is consistent with the case study in which the most accurate error corrections without changing the initial value of the sentence are obtained for such a neural network. The M2M100 has a higher BLEU score (0.847) than the “Ukrainian Roberta” Encoder-Decoder (0.697). However, subjectively evaluating the results of the correction of examples, the M2M100 does a much worse job than the other two models. For METEOR, M2M100 (0.925) also has a higher score than the “Ukrainian Roberta” Encoder-Decoder (0.876). ","PeriodicalId":43783,"journal":{"name":"Radio Electronics Computer Science Control","volume":"49 1","pages":""},"PeriodicalIF":0.3000,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radio Electronics Computer Science Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15588/1607-3274-2023-1-12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Context. Most research in grammatical and stylistic error correction focuses on error correction in English-language textual content. Thanks to the availability of large data sets, a significant increase in the accuracy of English grammar correction has been achieved. Unfortunately, there are few studies on other languages. Systems for the English language are constantly developing and currently actively use machine learning methods: classification (sequence tagging) and machine translation. A large amount of parallel or manually labelled data is required to build a high-quality machine learning model for correcting grammatical/stylistic errors in the texts of those morphologically complex languages. Manual data annotation requires a lot of effort by professional linguists, which makes the creation of text corpora, especially in morphologically rich languages, mainly Ukrainian, a time- and resource-consuming process. Objective of the study is to develop a technology for correcting errors in Ukrainian-language texts based on machine learning methods using a small set of annotated parallel data. Method. For this study, machine learning algorithms were selected when developing a system for correcting errors in Ukrainianlanguage texts using an optimal pipeline, including pre-processing and selecting text content and generating features in small annotated data corpora. The neural network’s use with a new architecture, a review of state-of-the-art methods, and a comparison of different pipeline stages will make it possible to determine such a combination of them, allowing a high-quality error correction model in Ukrainian-language texts. Results. A machine learning model for error correction in Ukrainian-language texts has been developed. A universal scheme for creating an error correction system for different languages is proposed. According to the results, the neural network can correct simple sentences written in Ukrainian. However, creating a full-fledged system will require spell-checking using dictionaries and checking rules, both simple and based on the result of parsing dependencies or other features. The pre-trained neural translation model mT5 has the best performance among the three models. To save computing resources, it is also possible to use a pre-trained BERT-type neural network as an encoder and a decoder. Such a neural network has half the number of parameters as other pretrained machine translation models and shows satisfactory results in correcting grammatical and stylistic errors. Conclusions. The created model shows excellent classification results on test data. The calculated machine translation quality metrics allow only a partial comparison of the models since most of the words and phrases in the original and corrected sentences are the same. The best value for both BLEU (0.908) and METEOR (0.956) is obtained for mT5, which is consistent with the case study in which the most accurate error corrections without changing the initial value of the sentence are obtained for such a neural network. The M2M100 has a higher BLEU score (0.847) than the “Ukrainian Roberta” Encoder-Decoder (0.697). However, subjectively evaluating the results of the correction of examples, the M2M100 does a much worse job than the other two models. For METEOR, M2M100 (0.925) also has a higher score than the “Ukrainian Roberta” Encoder-Decoder (0.876).

查看原文本刊更多论文

基于机器学习方法的乌克兰语文本内容语法错误纠正技术

上下文。语法和文体纠错的研究大多集中在英语文本内容的纠错上。由于大数据集的可用性，大大提高了英语语法纠正的准确性。不幸的是，对其他语言的研究很少。英语语言系统正在不断发展，目前积极使用机器学习方法:分类(序列标记)和机器翻译。要建立一个高质量的机器学习模型来纠正这些形态复杂语言文本中的语法/风格错误，需要大量的并行或手动标记的数据。手动数据注释需要专业语言学家付出大量的努力，这使得创建文本语料库，特别是在词法丰富的语言中，主要是乌克兰语，是一个耗时和消耗资源的过程。该研究的目的是开发一种基于机器学习方法的乌克兰语文本纠错技术，该技术使用一小组带注释的并行数据。方法。在本研究中，在使用最佳管道开发乌克兰语文本纠错系统时，选择了机器学习算法，包括预处理和选择文本内容以及在小型注释数据语料库中生成特征。神经网络与新架构的结合，对最先进方法的回顾，以及不同管道阶段的比较，将使确定这些组合成为可能，从而允许在乌克兰语文本中建立高质量的错误纠正模型。结果。已经开发了一种用于乌克兰语文本纠错的机器学习模型。提出了一种针对不同语言建立纠错系统的通用方案。结果表明，该神经网络能够正确地纠正用乌克兰语写的简单句子。然而，创建一个成熟的系统将需要使用字典和检查规则进行拼写检查，这既简单又基于解析依赖项或其他特性的结果。预训练神经翻译模型mT5在三种模型中表现最好。为了节省计算资源，也可以使用预训练的bert型神经网络作为编码器和解码器。这种神经网络的参数数量是其他预训练机器翻译模型的一半，并且在纠正语法和文体错误方面显示出令人满意的结果。结论。所建立的模型在测试数据上显示出良好的分类效果。计算出的机器翻译质量指标只允许对模型进行部分比较，因为原始句子和纠正句子中的大多数单词和短语是相同的。mT5的BLEU值(0.908)和METEOR值(0.956)均为最佳，这与该神经网络在不改变句子初始值的情况下获得最准确的纠错结果的案例研究相一致。M2M100的BLEU得分(0.847)高于“乌克兰罗伯塔”编解码器(0.697)。然而，主观上评价实例校正的结果，M2M100比另外两个模型做得差得多。对于METEOR, M2M100(0.925)的得分也高于“乌克兰Roberta”编码器-解码器(0.876)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Radio Electronics Computer Science Control COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-

自引率

20.00%

发文量

审稿时长

12 weeks