机器翻译中文本预处理的语言模型

IF 0.5 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Pub Date : 2025-10-06 DOI:10.3103/S0005105525700645

A. V. Mylnikova, L. A. Mylnikov

{"title":"机器翻译中文本预处理的语言模型","authors":"A. V. Mylnikova, L. A. Mylnikov","doi":"10.3103/S0005105525700645","DOIUrl":null,"url":null,"abstract":"<p>This paper examines a model for the use of syntactic parsing-based text skeleton structures for the preprocessing of text corpora before they are transferred to MT neural network models to enhance their performance quality. In the paper, a model is suggested for text corpora, which is based on parts-of-speech (POS) tagging and syntactic parsing; this model is implemented on BERT network-based language model and a set of rules. A limited POS tagging dataset is taken in this paper to describe how data are prepared for the training of the model and how its efficiency performance can be improved. POS tagging is used in the paper to obtain syntactic parsing and determine the type of a sentence and word order changes according to the predefined rules. The application of the model, suggested in the paper, together with the MT language models Google and Yandex, allowed MT quality metrics to be increased by 0.1–0.23 according to BLEU and TER for Russian–English and German–English language pairs.</p>","PeriodicalId":42995,"journal":{"name":"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS","volume":"59 4","pages":"256 - 268"},"PeriodicalIF":0.5000,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Language Models for Texts Preprocessing in Machine Translation\",\"authors\":\"A. V. Mylnikova, L. A. Mylnikov\",\"doi\":\"10.3103/S0005105525700645\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>This paper examines a model for the use of syntactic parsing-based text skeleton structures for the preprocessing of text corpora before they are transferred to MT neural network models to enhance their performance quality. In the paper, a model is suggested for text corpora, which is based on parts-of-speech (POS) tagging and syntactic parsing; this model is implemented on BERT network-based language model and a set of rules. A limited POS tagging dataset is taken in this paper to describe how data are prepared for the training of the model and how its efficiency performance can be improved. POS tagging is used in the paper to obtain syntactic parsing and determine the type of a sentence and word order changes according to the predefined rules. The application of the model, suggested in the paper, together with the MT language models Google and Yandex, allowed MT quality metrics to be increased by 0.1–0.23 according to BLEU and TER for Russian–English and German–English language pairs.</p>\",\"PeriodicalId\":42995,\"journal\":{\"name\":\"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS\",\"volume\":\"59 4\",\"pages\":\"256 - 268\"},\"PeriodicalIF\":0.5000,\"publicationDate\":\"2025-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://link.springer.com/article/10.3103/S0005105525700645\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.3103/S0005105525700645","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

本文研究了一个使用基于句法解析的文本骨架结构对文本语料库进行预处理的模型，然后将其转移到机器翻译神经网络模型中，以提高其性能质量。本文提出了一种基于词性标注和句法分析的文本语料库模型；该模型是在基于BERT网络的语言模型和一组规则上实现的。本文采用有限的POS标注数据集来描述如何为模型的训练准备数据以及如何提高其效率性能。本文使用词性标注进行句法分析，根据预定义的规则确定句子的类型和词序变化。本文提出的模型与机器翻译语言模型谷歌和Yandex的应用，使俄语-英语和德语-英语语言对的机器翻译质量指标根据BLEU和TER提高了0.1-0.23。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Language Models for Texts Preprocessing in Machine Translation

查看原文本刊更多论文

Language Models for Texts Preprocessing in Machine Translation

This paper examines a model for the use of syntactic parsing-based text skeleton structures for the preprocessing of text corpora before they are transferred to MT neural network models to enhance their performance quality. In the paper, a model is suggested for text corpora, which is based on parts-of-speech (POS) tagging and syntactic parsing; this model is implemented on BERT network-based language model and a set of rules. A limited POS tagging dataset is taken in this paper to describe how data are prepared for the training of the model and how its efficiency performance can be improved. POS tagging is used in the paper to obtain syntactic parsing and determine the type of a sentence and word order changes according to the predefined rules. The application of the model, suggested in the paper, together with the MT language models Google and Yandex, allowed MT quality metrics to be increased by 0.1–0.23 according to BLEU and TER for Russian–English and German–English language pairs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS COMPUTER SCIENCE, INFORMATION SYSTEMS-

自引率

40.00%

发文量

期刊介绍： Automatic Documentation and Mathematical Linguistics is an international peer reviewed journal that covers all aspects of automation of information processes and systems, as well as algorithms and methods for automatic language analysis. Emphasis is on the practical applications of new technologies and techniques for information analysis and processing.