资源不足的德拉威语不同正字法的机器翻译比较

International Conference on Language, Data, and Knowledge Pub Date : 2019-05-20 DOI:10.4230/OASIcs.LDK.2019.6

Bharathi Raja Chakravarthi, Mihael Arcan, John P. McCrae

{"title":"资源不足的德拉威语不同正字法的机器翻译比较","authors":"Bharathi Raja Chakravarthi, Mihael Arcan, John P. McCrae","doi":"10.4230/OASIcs.LDK.2019.6","DOIUrl":null,"url":null,"abstract":"Under-resourced languages are a significant challenge for statistical approaches to machine translation, and recently it has been shown that the usage of training data from closely-related languages can improve machine translation quality of these languages. While languages within the same language family share many properties, many under-resourced languages are written in their own native script, which makes taking advantage of these language similarities difficult. In this paper, we propose to alleviate the problem of different scripts by transcribing the native script into common representation i.e. the Latin script or the International Phonetic Alphabet (IPA). In particular, we compare the difference between coarse-grained transliteration to the Latin script and fine-grained IPA transliteration. We performed experiments on the language pairs English-Tamil, English-Telugu, and English-Kannada translation task. Our results show improvements in terms of the BLEU, METEOR and chrF scores from transliteration and we find that the transliteration into the Latin script outperforms the fine-grained IPA transcription.","PeriodicalId":377119,"journal":{"name":"International Conference on Language, Data, and Knowledge","volume":"726-731 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"51","resultStr":"{\"title\":\"Comparison of Different Orthographies for Machine Translation of Under-Resourced Dravidian Languages\",\"authors\":\"Bharathi Raja Chakravarthi, Mihael Arcan, John P. McCrae\",\"doi\":\"10.4230/OASIcs.LDK.2019.6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Under-resourced languages are a significant challenge for statistical approaches to machine translation, and recently it has been shown that the usage of training data from closely-related languages can improve machine translation quality of these languages. While languages within the same language family share many properties, many under-resourced languages are written in their own native script, which makes taking advantage of these language similarities difficult. In this paper, we propose to alleviate the problem of different scripts by transcribing the native script into common representation i.e. the Latin script or the International Phonetic Alphabet (IPA). In particular, we compare the difference between coarse-grained transliteration to the Latin script and fine-grained IPA transliteration. We performed experiments on the language pairs English-Tamil, English-Telugu, and English-Kannada translation task. Our results show improvements in terms of the BLEU, METEOR and chrF scores from transliteration and we find that the transliteration into the Latin script outperforms the fine-grained IPA transcription.\",\"PeriodicalId\":377119,\"journal\":{\"name\":\"International Conference on Language, Data, and Knowledge\",\"volume\":\"726-731 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"51\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Language, Data, and Knowledge\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4230/OASIcs.LDK.2019.6\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Language, Data, and Knowledge","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/OASIcs.LDK.2019.6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 51

摘要

资源不足的语言是机器翻译统计方法面临的一个重大挑战，最近有研究表明，使用来自密切相关语言的训练数据可以提高这些语言的机器翻译质量。虽然同一语言族中的语言共享许多属性，但许多资源不足的语言都是用自己的本机脚本编写的，这使得难以利用这些语言相似性。在本文中，我们建议通过将本地文字转录成共同的表示法，即拉丁文字或国际音标(IPA)，来缓解不同文字的问题。特别地，我们比较了拉丁字母粗粒度音译和细粒度国际音标音译之间的差异。我们对英语-泰米尔语、英语-泰卢固语和英语-卡纳达语翻译任务进行了语言对实验。我们的研究结果表明，音译在BLEU、METEOR和chrF得分方面有所改善，我们发现拉丁字母的音译优于细粒度的IPA转录。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparison of Different Orthographies for Machine Translation of Under-Resourced Dravidian Languages

Under-resourced languages are a significant challenge for statistical approaches to machine translation, and recently it has been shown that the usage of training data from closely-related languages can improve machine translation quality of these languages. While languages within the same language family share many properties, many under-resourced languages are written in their own native script, which makes taking advantage of these language similarities difficult. In this paper, we propose to alleviate the problem of different scripts by transcribing the native script into common representation i.e. the Latin script or the International Phonetic Alphabet (IPA). In particular, we compare the difference between coarse-grained transliteration to the Latin script and fine-grained IPA transliteration. We performed experiments on the language pairs English-Tamil, English-Telugu, and English-Kannada translation task. Our results show improvements in terms of the BLEU, METEOR and chrF scores from transliteration and we find that the transliteration into the Latin script outperforms the fine-grained IPA transcription.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Language, Data, and Knowledge

自引率

0.00%

发文量