面向特定领域的僧伽罗-泰米尔语统计机器翻译的双语列表集成

Fathima Farhath, Surangika Ranathunga, Sanath Jayasena, G. Dias
{"title":"面向特定领域的僧伽罗-泰米尔语统计机器翻译的双语列表集成","authors":"Fathima Farhath, Surangika Ranathunga, Sanath Jayasena, G. Dias","doi":"10.1109/MERCON.2018.8421901","DOIUrl":null,"url":null,"abstract":"Availability of quality parallel data is a major requirement to build a reasonably well performing statistical machine translation (SMT) system. Thus, developing a decent SMT system for a low-resourced language pair like Sinhala and Tamil that does not have a large parallel corpus is rather challenging. Past research for other different language pairs has shown that different terminology / bilingual list integration methodologies can be used to improve the quality of SMT systems, for domain-specific SMT in particular. In this paper, we explore if this can be effective for Sinhala-Tamil machine translation for the domain of official government documents. We evaluate the impact of three types of bilingual lists, namely, a list of government organizations and official designations, a glossary related to government administration and operations, and a general bilingual dictionary, based on four different methodologies (three static and one dynamic). Out of four, one methodology gave notable improvements for all three types of list over the baseline.","PeriodicalId":6603,"journal":{"name":"2018 Moratuwa Engineering Research Conference (MERCon)","volume":"7 1","pages":"538-543"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Integration of Bilingual Lists for Domain-Specific Statistical Machine Translation for Sinhala-Tamil\",\"authors\":\"Fathima Farhath, Surangika Ranathunga, Sanath Jayasena, G. Dias\",\"doi\":\"10.1109/MERCON.2018.8421901\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Availability of quality parallel data is a major requirement to build a reasonably well performing statistical machine translation (SMT) system. Thus, developing a decent SMT system for a low-resourced language pair like Sinhala and Tamil that does not have a large parallel corpus is rather challenging. Past research for other different language pairs has shown that different terminology / bilingual list integration methodologies can be used to improve the quality of SMT systems, for domain-specific SMT in particular. In this paper, we explore if this can be effective for Sinhala-Tamil machine translation for the domain of official government documents. We evaluate the impact of three types of bilingual lists, namely, a list of government organizations and official designations, a glossary related to government administration and operations, and a general bilingual dictionary, based on four different methodologies (three static and one dynamic). Out of four, one methodology gave notable improvements for all three types of list over the baseline.\",\"PeriodicalId\":6603,\"journal\":{\"name\":\"2018 Moratuwa Engineering Research Conference (MERCon)\",\"volume\":\"7 1\",\"pages\":\"538-543\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 Moratuwa Engineering Research Conference (MERCon)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MERCON.2018.8421901\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Moratuwa Engineering Research Conference (MERCon)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MERCON.2018.8421901","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

摘要

高质量并行数据的可用性是构建一个性能相当良好的统计机器翻译(SMT)系统的主要要求。因此,为僧伽罗语和泰米尔语等资源匮乏的语言对开发一个像样的SMT系统是相当具有挑战性的,因为它们没有大量的并行语料库。过去对其他不同语言对的研究表明,可以使用不同的术语/双语列表集成方法来提高SMT系统的质量,特别是针对特定领域的SMT。在本文中,我们探讨这是否可以有效地用于官方政府文件领域的僧伽罗语-泰米尔语机器翻译。我们基于四种不同的方法(三种静态方法和一种动态方法)评估了三种类型的双语列表的影响,即政府组织和官方名称列表,与政府管理和运作相关的词汇表和通用双语词典。在四种方法中,有一种方法在基线上对所有三种类型的列表进行了显著改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Integration of Bilingual Lists for Domain-Specific Statistical Machine Translation for Sinhala-Tamil
Availability of quality parallel data is a major requirement to build a reasonably well performing statistical machine translation (SMT) system. Thus, developing a decent SMT system for a low-resourced language pair like Sinhala and Tamil that does not have a large parallel corpus is rather challenging. Past research for other different language pairs has shown that different terminology / bilingual list integration methodologies can be used to improve the quality of SMT systems, for domain-specific SMT in particular. In this paper, we explore if this can be effective for Sinhala-Tamil machine translation for the domain of official government documents. We evaluate the impact of three types of bilingual lists, namely, a list of government organizations and official designations, a glossary related to government administration and operations, and a general bilingual dictionary, based on four different methodologies (three static and one dynamic). Out of four, one methodology gave notable improvements for all three types of list over the baseline.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信