英语到曼尼普尔语的因式SMT系统源端重新排序

IF 0.8 Q4 ENGINEERING, ELECTRICAL & ELECTRONIC
Indika Maibam, Bipul Syam Purkayastha
{"title":"英语到曼尼普尔语的因式SMT系统源端重新排序","authors":"Indika Maibam, Bipul Syam Purkayastha","doi":"10.32985/ijeces.14.3.6","DOIUrl":null,"url":null,"abstract":"Similar languages with massive parallel corpora are readily implemented by large-scale systems using either Statistical Machine Translation (SMT) or Neural Machine Translation (NMT). Translations involving low-resource language pairs with linguistic divergence have always been a challenge. We consider one such pair, English-Manipuri, which shows linguistic divergence and belongs to the low resource category. For such language pairs, SMT gets better acclamation than NMT. However, SMT’s more prominent phrase- based model uses groupings of surface word forms treated as phrases for translation. Therefore, without any linguistic knowledge, it fails to learn a proper mapping between the source and target language symbols. Our model adopts a factored model of SMT (FSMT3*) with a part-of-speech (POS) tag as a factor to incorporate linguistic information about the languages followed by hand-coded reordering. The reordering of source sentences makes them similar to the target language allowing better mapping between source and target symbols. The reordering also converts long-distance reordering problems to monotone reordering that SMT models can better handle, thereby reducing the load during decoding time. Additionally, we discover that adding a POS feature data enhances the system’s precision. Experimental results using automatic evaluation metrics show that our model improved over phrase-based and other factored models using the lexicalised Moses reordering options. Our FSMT3* model shows an increase in the automatic scores of translation result over the factored model with lexicalised phrase reordering (FSMT2) by an amount of 11.05% (Bilingual Evaluation Understudy), 5.46% (F1), 9.35% (Precision), and 2.56% (Recall), respectively.","PeriodicalId":41912,"journal":{"name":"International Journal of Electrical and Computer Engineering Systems","volume":" ","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2023-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reordering of Source Side for a Factored English to Manipuri SMT System\",\"authors\":\"Indika Maibam, Bipul Syam Purkayastha\",\"doi\":\"10.32985/ijeces.14.3.6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Similar languages with massive parallel corpora are readily implemented by large-scale systems using either Statistical Machine Translation (SMT) or Neural Machine Translation (NMT). Translations involving low-resource language pairs with linguistic divergence have always been a challenge. We consider one such pair, English-Manipuri, which shows linguistic divergence and belongs to the low resource category. For such language pairs, SMT gets better acclamation than NMT. However, SMT’s more prominent phrase- based model uses groupings of surface word forms treated as phrases for translation. Therefore, without any linguistic knowledge, it fails to learn a proper mapping between the source and target language symbols. Our model adopts a factored model of SMT (FSMT3*) with a part-of-speech (POS) tag as a factor to incorporate linguistic information about the languages followed by hand-coded reordering. The reordering of source sentences makes them similar to the target language allowing better mapping between source and target symbols. The reordering also converts long-distance reordering problems to monotone reordering that SMT models can better handle, thereby reducing the load during decoding time. Additionally, we discover that adding a POS feature data enhances the system’s precision. Experimental results using automatic evaluation metrics show that our model improved over phrase-based and other factored models using the lexicalised Moses reordering options. Our FSMT3* model shows an increase in the automatic scores of translation result over the factored model with lexicalised phrase reordering (FSMT2) by an amount of 11.05% (Bilingual Evaluation Understudy), 5.46% (F1), 9.35% (Precision), and 2.56% (Recall), respectively.\",\"PeriodicalId\":41912,\"journal\":{\"name\":\"International Journal of Electrical and Computer Engineering Systems\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.8000,\"publicationDate\":\"2023-03-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Electrical and Computer Engineering Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.32985/ijeces.14.3.6\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Electrical and Computer Engineering Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32985/ijeces.14.3.6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

使用统计机器翻译(SMT)或神经机器翻译(NMT)的大规模系统很容易实现具有大量平行语料库的类似语言。具有语言差异的低资源语言对的翻译一直是一个挑战。我们考虑一个这样的组合,英语-曼尼普尔语,它显示出语言差异,属于低资源类别。对于这样的语言对,SMT比NMT更受欢迎。然而,SMT更突出的基于短语的模型使用表面词形式的分组作为短语进行翻译。因此,在没有任何语言知识的情况下,它无法学习到源语和目的语符号之间的正确映射。我们的模型采用SMT的因子模型(FSMT3*),其中词性(POS)标签作为因子,结合语言的语言信息,然后手工编码重新排序。源句子的重新排序使它们与目标语言相似,从而更好地映射源和目标符号。这种重排序还将长距离重排序问题转化为SMT模型可以更好地处理的单调重排序问题,从而减少解码期间的负载。此外,我们发现添加POS特征数据可以提高系统的精度。使用自动评估指标的实验结果表明,我们的模型比使用词汇化Moses重新排序选项的基于短语和其他因子的模型有改进。我们的FSMT3*模型显示,与含有词汇化短语重排(FSMT2)的因子模型相比,翻译结果的自动得分分别提高了11.05%(双语评估Understudy)、5.46% (F1)、9.35% (Precision)和2.56% (Recall)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Reordering of Source Side for a Factored English to Manipuri SMT System
Similar languages with massive parallel corpora are readily implemented by large-scale systems using either Statistical Machine Translation (SMT) or Neural Machine Translation (NMT). Translations involving low-resource language pairs with linguistic divergence have always been a challenge. We consider one such pair, English-Manipuri, which shows linguistic divergence and belongs to the low resource category. For such language pairs, SMT gets better acclamation than NMT. However, SMT’s more prominent phrase- based model uses groupings of surface word forms treated as phrases for translation. Therefore, without any linguistic knowledge, it fails to learn a proper mapping between the source and target language symbols. Our model adopts a factored model of SMT (FSMT3*) with a part-of-speech (POS) tag as a factor to incorporate linguistic information about the languages followed by hand-coded reordering. The reordering of source sentences makes them similar to the target language allowing better mapping between source and target symbols. The reordering also converts long-distance reordering problems to monotone reordering that SMT models can better handle, thereby reducing the load during decoding time. Additionally, we discover that adding a POS feature data enhances the system’s precision. Experimental results using automatic evaluation metrics show that our model improved over phrase-based and other factored models using the lexicalised Moses reordering options. Our FSMT3* model shows an increase in the automatic scores of translation result over the factored model with lexicalised phrase reordering (FSMT2) by an amount of 11.05% (Bilingual Evaluation Understudy), 5.46% (F1), 9.35% (Precision), and 2.56% (Recall), respectively.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
1.20
自引率
11.80%
发文量
69
期刊介绍: The International Journal of Electrical and Computer Engineering Systems publishes original research in the form of full papers, case studies, reviews and surveys. It covers theory and application of electrical and computer engineering, synergy of computer systems and computational methods with electrical and electronic systems, as well as interdisciplinary research. Power systems Renewable electricity production Power electronics Electrical drives Industrial electronics Communication systems Advanced modulation techniques RFID devices and systems Signal and data processing Image processing Multimedia systems Microelectronics Instrumentation and measurement Control systems Robotics Modeling and simulation Modern computer architectures Computer networks Embedded systems High-performance computing Engineering education Parallel and distributed computer systems Human-computer systems Intelligent systems Multi-agent and holonic systems Real-time systems Software engineering Internet and web applications and systems Applications of computer systems in engineering and related disciplines Mathematical models of engineering systems Engineering management.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信