Language localisation of Tamil using Statistical Machine Translation

2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer) Pub Date : 2015-08-01 DOI:10.1109/ICTER.2015.7377677

Y. Achchuthan, Kengatharaiyer Sarveswaran

{"title":"Language localisation of Tamil using Statistical Machine Translation","authors":"Y. Achchuthan, Kengatharaiyer Sarveswaran","doi":"10.1109/ICTER.2015.7377677","DOIUrl":null,"url":null,"abstract":"Language localisation, where the strings in interface and documentation are translated to a new language, is a rigorous and time consuming task. On the other hand machine translation systems, specifically Statistical Machine Translation (SMT) systems, are successfully used among many language pairs. A few SMT systems have been developed for generic domain; however, there are no systems available to aid localisation yet. This research proposes a new methodology in which language localisation can be done using SMT. This research also identifies suitable parameters on which a SMT aided localisation system could be built. A pilot system is developed and the system is also outlined in this paper. A RESTful API has also been developed to facilitate localisation in remote tools. Several open source software have been translated already to Tamil. Those translated English - Tamil pairs were collected from various language resource files and then cleaned, tokenised and were used to train the system. Another similar system is prepared with data from generic domain apart from the collected technical data. Systems were trained with 2-gram, 3-gram and 4-gram language models that are created using two different language modelling tools namely KenLM and IRSTLM. Then the results were evaluated using BLEU algorithm. Appropriate parameters for setting up SMT system for localisation were identified from the evaluation. The results show that it would be enough to train a system with 3-gram, and the modified BLEU algorithm will give better understanding of the results compare to the original implementation of it. Further KenLM was found to perform better than IRSTM in terms of accuracy of results and the speed of execution.","PeriodicalId":142561,"journal":{"name":"2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer)","volume":"263 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTER.2015.7377677","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Language localisation, where the strings in interface and documentation are translated to a new language, is a rigorous and time consuming task. On the other hand machine translation systems, specifically Statistical Machine Translation (SMT) systems, are successfully used among many language pairs. A few SMT systems have been developed for generic domain; however, there are no systems available to aid localisation yet. This research proposes a new methodology in which language localisation can be done using SMT. This research also identifies suitable parameters on which a SMT aided localisation system could be built. A pilot system is developed and the system is also outlined in this paper. A RESTful API has also been developed to facilitate localisation in remote tools. Several open source software have been translated already to Tamil. Those translated English - Tamil pairs were collected from various language resource files and then cleaned, tokenised and were used to train the system. Another similar system is prepared with data from generic domain apart from the collected technical data. Systems were trained with 2-gram, 3-gram and 4-gram language models that are created using two different language modelling tools namely KenLM and IRSTLM. Then the results were evaluated using BLEU algorithm. Appropriate parameters for setting up SMT system for localisation were identified from the evaluation. The results show that it would be enough to train a system with 3-gram, and the modified BLEU algorithm will give better understanding of the results compare to the original implementation of it. Further KenLM was found to perform better than IRSTM in terms of accuracy of results and the speed of execution.

查看原文本刊更多论文

使用统计机器翻译的泰米尔语语言本地化

语言本地化(将接口和文档中的字符串翻译成新语言)是一项严格且耗时的任务。另一方面，机器翻译系统，特别是统计机器翻译(SMT)系统，成功地应用于许多语言对之间。针对通用领域已经开发了一些SMT系统;然而，目前还没有可用的系统来帮助本地化。这项研究提出了一种新的方法，在这种方法中，语言本地化可以使用SMT完成。本研究还确定了合适的参数，在此基础上可以建立SMT辅助定位系统。开发了一个试验系统，并对系统进行了概述。还开发了一个RESTful API来促进远程工具的本地化。一些开源软件已经被翻译成泰米尔语。这些翻译的英语-泰米尔语对从各种语言资源文件中收集，然后进行清理，标记并用于训练系统。除了收集的技术数据外，还使用通用领域的数据编制了另一个类似的系统。系统使用2克、3克和4克语言模型进行训练，这些模型是使用KenLM和IRSTLM两种不同的语言建模工具创建的。然后用BLEU算法对结果进行评价。通过评价，确定了建立本地化SMT系统的适当参数。结果表明，用3-gram训练一个系统就足够了，改进的BLEU算法比原来的实现能更好地理解结果。进一步发现KenLM在结果的准确性和执行速度方面优于IRSTM。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer)

自引率

0.00%

发文量