机器翻译中文本预处理的语言模型

IF 0.5 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS
A. V. Mylnikova, L. A. Mylnikov
{"title":"机器翻译中文本预处理的语言模型","authors":"A. V. Mylnikova,&nbsp;L. A. Mylnikov","doi":"10.3103/S0005105525700645","DOIUrl":null,"url":null,"abstract":"<p>This paper examines a model for the use of syntactic parsing-based text skeleton structures for the preprocessing of text corpora before they are transferred to MT neural network models to enhance their performance quality. In the paper, a model is suggested for text corpora, which is based on parts-of-speech (POS) tagging and syntactic parsing; this model is implemented on BERT network-based language model and a set of rules. A limited POS tagging dataset is taken in this paper to describe how data are prepared for the training of the model and how its efficiency performance can be improved. POS tagging is used in the paper to obtain syntactic parsing and determine the type of a sentence and word order changes according to the predefined rules. The application of the model, suggested in the paper, together with the MT language models Google and Yandex, allowed MT quality metrics to be increased by 0.1–0.23 according to BLEU and TER for Russian–English and German–English language pairs.</p>","PeriodicalId":42995,"journal":{"name":"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS","volume":"59 4","pages":"256 - 268"},"PeriodicalIF":0.5000,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Language Models for Texts Preprocessing in Machine Translation\",\"authors\":\"A. V. Mylnikova,&nbsp;L. A. Mylnikov\",\"doi\":\"10.3103/S0005105525700645\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>This paper examines a model for the use of syntactic parsing-based text skeleton structures for the preprocessing of text corpora before they are transferred to MT neural network models to enhance their performance quality. In the paper, a model is suggested for text corpora, which is based on parts-of-speech (POS) tagging and syntactic parsing; this model is implemented on BERT network-based language model and a set of rules. A limited POS tagging dataset is taken in this paper to describe how data are prepared for the training of the model and how its efficiency performance can be improved. POS tagging is used in the paper to obtain syntactic parsing and determine the type of a sentence and word order changes according to the predefined rules. The application of the model, suggested in the paper, together with the MT language models Google and Yandex, allowed MT quality metrics to be increased by 0.1–0.23 according to BLEU and TER for Russian–English and German–English language pairs.</p>\",\"PeriodicalId\":42995,\"journal\":{\"name\":\"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS\",\"volume\":\"59 4\",\"pages\":\"256 - 268\"},\"PeriodicalIF\":0.5000,\"publicationDate\":\"2025-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://link.springer.com/article/10.3103/S0005105525700645\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.3103/S0005105525700645","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

本文研究了一个使用基于句法解析的文本骨架结构对文本语料库进行预处理的模型,然后将其转移到机器翻译神经网络模型中,以提高其性能质量。本文提出了一种基于词性标注和句法分析的文本语料库模型;该模型是在基于BERT网络的语言模型和一组规则上实现的。本文采用有限的POS标注数据集来描述如何为模型的训练准备数据以及如何提高其效率性能。本文使用词性标注进行句法分析,根据预定义的规则确定句子的类型和词序变化。本文提出的模型与机器翻译语言模型谷歌和Yandex的应用,使俄语-英语和德语-英语语言对的机器翻译质量指标根据BLEU和TER提高了0.1-0.23。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Language Models for Texts Preprocessing in Machine Translation

Language Models for Texts Preprocessing in Machine Translation

This paper examines a model for the use of syntactic parsing-based text skeleton structures for the preprocessing of text corpora before they are transferred to MT neural network models to enhance their performance quality. In the paper, a model is suggested for text corpora, which is based on parts-of-speech (POS) tagging and syntactic parsing; this model is implemented on BERT network-based language model and a set of rules. A limited POS tagging dataset is taken in this paper to describe how data are prepared for the training of the model and how its efficiency performance can be improved. POS tagging is used in the paper to obtain syntactic parsing and determine the type of a sentence and word order changes according to the predefined rules. The application of the model, suggested in the paper, together with the MT language models Google and Yandex, allowed MT quality metrics to be increased by 0.1–0.23 according to BLEU and TER for Russian–English and German–English language pairs.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS
AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS COMPUTER SCIENCE, INFORMATION SYSTEMS-
自引率
40.00%
发文量
18
期刊介绍: Automatic Documentation and Mathematical Linguistics  is an international peer reviewed journal that covers all aspects of automation of information processes and systems, as well as algorithms and methods for automatic language analysis. Emphasis is on the practical applications of new technologies and techniques for information analysis and processing.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信