Malay Manuscripts Transliteration Using Statistical Machine Translation (SMT)

Sitti Munirah Abdul Razak, Muhamad Sadry Abu Seman, Wan Ali, Wan Yusoff Wan, Noor Hasrul Nizan, Mohammad Noor
{"title":"Malay Manuscripts Transliteration Using Statistical Machine Translation (SMT)","authors":"Sitti Munirah Abdul Razak, Muhamad Sadry Abu Seman, Wan Ali, Wan Yusoff Wan, Noor Hasrul Nizan, Mohammad Noor","doi":"10.1109/AiDAS47888.2019.8970867","DOIUrl":null,"url":null,"abstract":"Natural Language Processing (NLP) is a vital field of artificial intelligence that automates the study of human language. However for Malay manuscripts (MM) written in old jawi, its exposure on such field is limited. Besides, most of the studies related to MM studies and NLP were focused on rule based or rule based machine transliteration (RBMT). Hence the objective of this study is to propose a statistical approach for old jawi to modern jawi transliteration of Malay manuscript contents using Phrase Based Statistical Machine Translation (PBSMT) as its model. In order to achieve such purpose, quality score of Word Error Rate (WER) was computed on the transliteration output. Besides, the issues formerly encountered by rule based approach such as vocals limitation and homograph, reduplication, letters error and combination of multiple words were observed in the implementation. Moreover, this paper utilized exploratory approach as its research strategy and mixed method as its research method. The data for the analysis were extracted from a MM titled Bidāyat al-Mubtadī bi-Fālillah al-Muhdī. Quality score of WER was computed for the evaluation of SMT output. Afterwards, related issues were identified and assessed. The research found that quality score of PBSMT for old jawi to modern jawi transliteration was high in terms of WER, however the issues of rule based were generally addressed by PBSMT except homograph. The research is however limited to the approach of SMT that solely focused on PBSMT as its model. Moreover, the corpus size was limited to one manuscript while SMT relies on corpus size. Nevertheless the research contributes to the wider coverage on Malay language as one of the under resource languages in NLP, in form of old and modern jawi. Besides, to the best of the researcher’s knowledge, it is also the first to apply SMT (PBSMT) approach on old jawi transliteration. Most importantly, the study is to contribute on MM’s.","PeriodicalId":227508,"journal":{"name":"2019 1st International Conference on Artificial Intelligence and Data Sciences (AiDAS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 1st International Conference on Artificial Intelligence and Data Sciences (AiDAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AiDAS47888.2019.8970867","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Natural Language Processing (NLP) is a vital field of artificial intelligence that automates the study of human language. However for Malay manuscripts (MM) written in old jawi, its exposure on such field is limited. Besides, most of the studies related to MM studies and NLP were focused on rule based or rule based machine transliteration (RBMT). Hence the objective of this study is to propose a statistical approach for old jawi to modern jawi transliteration of Malay manuscript contents using Phrase Based Statistical Machine Translation (PBSMT) as its model. In order to achieve such purpose, quality score of Word Error Rate (WER) was computed on the transliteration output. Besides, the issues formerly encountered by rule based approach such as vocals limitation and homograph, reduplication, letters error and combination of multiple words were observed in the implementation. Moreover, this paper utilized exploratory approach as its research strategy and mixed method as its research method. The data for the analysis were extracted from a MM titled Bidāyat al-Mubtadī bi-Fālillah al-Muhdī. Quality score of WER was computed for the evaluation of SMT output. Afterwards, related issues were identified and assessed. The research found that quality score of PBSMT for old jawi to modern jawi transliteration was high in terms of WER, however the issues of rule based were generally addressed by PBSMT except homograph. The research is however limited to the approach of SMT that solely focused on PBSMT as its model. Moreover, the corpus size was limited to one manuscript while SMT relies on corpus size. Nevertheless the research contributes to the wider coverage on Malay language as one of the under resource languages in NLP, in form of old and modern jawi. Besides, to the best of the researcher’s knowledge, it is also the first to apply SMT (PBSMT) approach on old jawi transliteration. Most importantly, the study is to contribute on MM’s.
马来语手稿的统计机器翻译(SMT)
自然语言处理(NLP)是人工智能的一个重要领域,它使人类语言的研究自动化。然而,对于马来手稿(MM)写在旧爪哇语,它在这个领域的曝光是有限的。此外,大多数与MM研究和NLP相关的研究都集中在基于规则或基于规则的机器音译(RBMT)上。因此,本研究的目的是提出一种以基于短语的统计机器翻译(PBSMT)为模型的马来文手稿内容的古爪哇语到现代爪哇语音译的统计方法。为了达到这一目的,在音译输出上计算单词错误率(WER)的质量分数。此外,在实施过程中还发现了以往基于规则的方法所遇到的语音限制、同形词、重复、字母错误和多词组合等问题。本文采用探索性方法作为研究策略,混合方法作为研究方法。用于分析的数据是从题为Bidāyat al- mubtadi bi-Fālillah al- muhdi的MM中提取的。计算WER质量分数,评价SMT输出。随后,对相关问题进行了识别和评估。研究发现,古爪哇语到现代爪哇语音译的PBSMT在WER方面质量得分较高,但除同形词外,PBSMT普遍解决了基于规则的问题。然而,研究仅限于SMT方法,仅以PBSMT为模型。此外,语料库规模仅限于一篇稿件,而SMT依赖于语料库规模。然而,该研究有助于马来语作为自然语言处理中的资源语言之一,以旧爪哇语和现代爪哇语的形式进行更广泛的覆盖。此外,据研究者所知,它也是第一个将SMT (PBSMT)方法应用于旧爪文音译的。最重要的是,这项研究是对MM的贡献。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信