基于lstm的印尼语-马杜雷语机器翻译

Danang Arbian Sulistyo
{"title":"基于lstm的印尼语-马杜雷语机器翻译","authors":"Danang Arbian Sulistyo","doi":"10.47738/jads.v4i3.113","DOIUrl":null,"url":null,"abstract":"Madurese is one of the regional languages in Indonesia, which dominates East Java and Madura Island in particular. The use of Madurese as a daily language has declined significantly due to a language shift in children and adolescents, some of which are caused by a sense of prestige and difficulty in learning Madurese. The scarcity of research or scientific titles that raises the Madurese language also helps reduce literacy in the language. Our research focuses on creating a translation machine for Madurese to Indonesian to maintain and preserve the existence of the Madurese language so that learning can be done through digital media. This study use the latest dataset for the Madurese-Indonesian language by using a corpus of 30,000 Madura-Indonesian sentence pairs from the online Bible. This study scrapped online Bible pages to organize the corpus based on the Indonesian and Madurese bilingual Bible. Then This study manually process text to match the two languages' scrapping results, normalization, and tokenization to remove non-printable characters and punctuation from the corpus. To perform neural machine translation (NMT), This study connected the RNN encoder with the RNN decoder of the language model, while for training and testing, This study used a sequential model with LSTM, while the BLEU measure was used to assess the accuracy of the translation results. This study used the SoftMax optimization function with Adam Optimizer and added some settings, including using 128 layers in the training process and adding a Dropout layer so that This study got the average evaluation result for BLEU-1 is 0.798068, BLEU-2 is 0.680932, BLEU-3 is 0.623489, and for BLEU-4 is 0.523546 from five tests conducted. Given the language differences between Madurese and Indonesian, this can be the best approach for machine translation of Indonesian to Madurese.","PeriodicalId":479720,"journal":{"name":"Journal of Applied Data Sciences","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LSTM-Based Machine Translation for Madurese-Indonesian\",\"authors\":\"Danang Arbian Sulistyo\",\"doi\":\"10.47738/jads.v4i3.113\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Madurese is one of the regional languages in Indonesia, which dominates East Java and Madura Island in particular. The use of Madurese as a daily language has declined significantly due to a language shift in children and adolescents, some of which are caused by a sense of prestige and difficulty in learning Madurese. The scarcity of research or scientific titles that raises the Madurese language also helps reduce literacy in the language. Our research focuses on creating a translation machine for Madurese to Indonesian to maintain and preserve the existence of the Madurese language so that learning can be done through digital media. This study use the latest dataset for the Madurese-Indonesian language by using a corpus of 30,000 Madura-Indonesian sentence pairs from the online Bible. This study scrapped online Bible pages to organize the corpus based on the Indonesian and Madurese bilingual Bible. Then This study manually process text to match the two languages' scrapping results, normalization, and tokenization to remove non-printable characters and punctuation from the corpus. To perform neural machine translation (NMT), This study connected the RNN encoder with the RNN decoder of the language model, while for training and testing, This study used a sequential model with LSTM, while the BLEU measure was used to assess the accuracy of the translation results. This study used the SoftMax optimization function with Adam Optimizer and added some settings, including using 128 layers in the training process and adding a Dropout layer so that This study got the average evaluation result for BLEU-1 is 0.798068, BLEU-2 is 0.680932, BLEU-3 is 0.623489, and for BLEU-4 is 0.523546 from five tests conducted. Given the language differences between Madurese and Indonesian, this can be the best approach for machine translation of Indonesian to Madurese.\",\"PeriodicalId\":479720,\"journal\":{\"name\":\"Journal of Applied Data Sciences\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Applied Data Sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.47738/jads.v4i3.113\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Data Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.47738/jads.v4i3.113","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

马杜罗语是印度尼西亚的一种地区性语言,尤其在东爪哇和马杜拉岛占主导地位。由于儿童和青少年的语言转变,使用马杜罗语作为日常语言的人数大幅下降,其中一些是由于威信感和学习马杜罗语的困难造成的。提高马杜罗语水平的研究或科学头衔的缺乏也有助于降低该语言的识字率。我们的研究重点是创建一个马杜罗语到印尼语的翻译机器,以维持和保存马杜罗语的存在,以便通过数字媒体进行学习。这项研究使用了最新的马杜罗语-印尼语数据集,使用了来自在线圣经的30,000对马杜罗语-印尼语句子。这项研究取消了在线圣经页面,以印尼语和马杜罗语双语圣经为基础组织语料库。然后,本研究手动处理文本以匹配两种语言的废弃结果,规范化和标记化以从语料库中删除不可打印的字符和标点符号。为了进行神经机器翻译(NMT),本研究将语言模型的RNN编码器与RNN解码器连接起来,而对于训练和测试,本研究使用了具有LSTM的序列模型,并使用BLEU度量来评估翻译结果的准确性。本研究使用了带有Adam Optimizer的SoftMax优化函数,并增加了一些设置,包括在训练过程中使用128层,并增加了Dropout层,因此本研究通过五次测试得到BLEU-1的平均评价结果为0.798068,BLEU-2为0.680932,BLEU-3为0.623489,BLEU-4为0.523546。考虑到印尼语和印尼语之间的语言差异,这可能是印尼语到印尼语的最佳机器翻译方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
LSTM-Based Machine Translation for Madurese-Indonesian
Madurese is one of the regional languages in Indonesia, which dominates East Java and Madura Island in particular. The use of Madurese as a daily language has declined significantly due to a language shift in children and adolescents, some of which are caused by a sense of prestige and difficulty in learning Madurese. The scarcity of research or scientific titles that raises the Madurese language also helps reduce literacy in the language. Our research focuses on creating a translation machine for Madurese to Indonesian to maintain and preserve the existence of the Madurese language so that learning can be done through digital media. This study use the latest dataset for the Madurese-Indonesian language by using a corpus of 30,000 Madura-Indonesian sentence pairs from the online Bible. This study scrapped online Bible pages to organize the corpus based on the Indonesian and Madurese bilingual Bible. Then This study manually process text to match the two languages' scrapping results, normalization, and tokenization to remove non-printable characters and punctuation from the corpus. To perform neural machine translation (NMT), This study connected the RNN encoder with the RNN decoder of the language model, while for training and testing, This study used a sequential model with LSTM, while the BLEU measure was used to assess the accuracy of the translation results. This study used the SoftMax optimization function with Adam Optimizer and added some settings, including using 128 layers in the training process and adding a Dropout layer so that This study got the average evaluation result for BLEU-1 is 0.798068, BLEU-2 is 0.680932, BLEU-3 is 0.623489, and for BLEU-4 is 0.523546 from five tests conducted. Given the language differences between Madurese and Indonesian, this can be the best approach for machine translation of Indonesian to Madurese.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
3.30
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信