Improve the automatic transliteration from Nôm scripts into Vietnamese National scripts by integrating Sino – Vietnamese knowledge into Statistical Machine Translation
{"title":"Improve the automatic transliteration from Nôm scripts into Vietnamese National scripts by integrating Sino – Vietnamese knowledge into Statistical Machine Translation","authors":"Lam H. Thai, Long H. B. Nguyen, Dinh Dien","doi":"10.1145/3548636.3548647","DOIUrl":null,"url":null,"abstract":"Nôm scripts (chữ Nôm) are Vietnamese ancient scripts that were popularly used in Vietnam from the 10th century to the early 20th century. Nowadays, some automatic transliteration from Nôm scripts (NS) into Vietnamese National scripts (chữ Quốc ngữ - QN) systems were developed to help modern Vietnamese people acquire many valuable lessons and knowledge from previous generations through preserving the Sino-Nom heritage. However, these systems have still not performed well in many domains, except for Literature. Our research continues to employ Statistical Machine Translation (SMT) but expands the dataset up to 10 domains. Furthermore, we also focus on analyzing the impact of Chinese scripts with Sino-Vietnamese readings on Nôm script – National script and then integrating this knowledge into our transliteration model. Our experimental results show that our approach helps the model reach 94.04 BLEU score, dramatically increasing by 8.63 BLEU score in the genealogical domain and 0.31 BLEU score in the general model.","PeriodicalId":384376,"journal":{"name":"Proceedings of the 4th International Conference on Information Technology and Computer Communications","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Information Technology and Computer Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3548636.3548647","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Nôm scripts (chữ Nôm) are Vietnamese ancient scripts that were popularly used in Vietnam from the 10th century to the early 20th century. Nowadays, some automatic transliteration from Nôm scripts (NS) into Vietnamese National scripts (chữ Quốc ngữ - QN) systems were developed to help modern Vietnamese people acquire many valuable lessons and knowledge from previous generations through preserving the Sino-Nom heritage. However, these systems have still not performed well in many domains, except for Literature. Our research continues to employ Statistical Machine Translation (SMT) but expands the dataset up to 10 domains. Furthermore, we also focus on analyzing the impact of Chinese scripts with Sino-Vietnamese readings on Nôm script – National script and then integrating this knowledge into our transliteration model. Our experimental results show that our approach helps the model reach 94.04 BLEU score, dramatically increasing by 8.63 BLEU score in the genealogical domain and 0.31 BLEU score in the general model.