将中越知识集成到统计机器翻译中，提高了Nôm文字自动转写成越南国家文字的能力

Proceedings of the 4th International Conference on Information Technology and Computer Communications Pub Date : 2022-06-23 DOI:10.1145/3548636.3548647

Lam H. Thai, Long H. B. Nguyen, Dinh Dien

{"title":"将中越知识集成到统计机器翻译中，提高了Nôm文字自动转写成越南国家文字的能力","authors":"Lam H. Thai, Long H. B. Nguyen, Dinh Dien","doi":"10.1145/3548636.3548647","DOIUrl":null,"url":null,"abstract":"Nôm scripts (chữ Nôm) are Vietnamese ancient scripts that were popularly used in Vietnam from the 10th century to the early 20th century. Nowadays, some automatic transliteration from Nôm scripts (NS) into Vietnamese National scripts (chữ Quốc ngữ - QN) systems were developed to help modern Vietnamese people acquire many valuable lessons and knowledge from previous generations through preserving the Sino-Nom heritage. However, these systems have still not performed well in many domains, except for Literature. Our research continues to employ Statistical Machine Translation (SMT) but expands the dataset up to 10 domains. Furthermore, we also focus on analyzing the impact of Chinese scripts with Sino-Vietnamese readings on Nôm script – National script and then integrating this knowledge into our transliteration model. Our experimental results show that our approach helps the model reach 94.04 BLEU score, dramatically increasing by 8.63 BLEU score in the genealogical domain and 0.31 BLEU score in the general model.","PeriodicalId":384376,"journal":{"name":"Proceedings of the 4th International Conference on Information Technology and Computer Communications","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Improve the automatic transliteration from Nôm scripts into Vietnamese National scripts by integrating Sino – Vietnamese knowledge into Statistical Machine Translation\",\"authors\":\"Lam H. Thai, Long H. B. Nguyen, Dinh Dien\",\"doi\":\"10.1145/3548636.3548647\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nôm scripts (chữ Nôm) are Vietnamese ancient scripts that were popularly used in Vietnam from the 10th century to the early 20th century. Nowadays, some automatic transliteration from Nôm scripts (NS) into Vietnamese National scripts (chữ Quốc ngữ - QN) systems were developed to help modern Vietnamese people acquire many valuable lessons and knowledge from previous generations through preserving the Sino-Nom heritage. However, these systems have still not performed well in many domains, except for Literature. Our research continues to employ Statistical Machine Translation (SMT) but expands the dataset up to 10 domains. Furthermore, we also focus on analyzing the impact of Chinese scripts with Sino-Vietnamese readings on Nôm script – National script and then integrating this knowledge into our transliteration model. Our experimental results show that our approach helps the model reach 94.04 BLEU score, dramatically increasing by 8.63 BLEU score in the genealogical domain and 0.31 BLEU score in the general model.\",\"PeriodicalId\":384376,\"journal\":{\"name\":\"Proceedings of the 4th International Conference on Information Technology and Computer Communications\",\"volume\":\"52 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th International Conference on Information Technology and Computer Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3548636.3548647\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Information Technology and Computer Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3548636.3548647","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

Nôm文字(chnguyen Nôm)是越南古代文字，从10世纪到20世纪初在越南广泛使用。如今，一些将Nôm文字(NS)自动转写为越南国家文字(chnguyen Quốc ngnguyen - QN)的系统被开发出来，以帮助现代越南人通过保存汉nom遗产从前辈那里获得许多宝贵的经验和知识。然而，除了文学之外，这些系统在许多领域仍然表现不佳。我们的研究继续使用统计机器翻译(SMT)，但将数据集扩展到10个域。此外，我们还重点分析了中越文字对Nôm script - National script的影响，然后将这些知识整合到我们的音译模型中。我们的实验结果表明，我们的方法使模型达到了94.04 BLEU分数，在家谱领域显著提高了8.63 BLEU分数，在一般模型中显著提高了0.31 BLEU分数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improve the automatic transliteration from Nôm scripts into Vietnamese National scripts by integrating Sino – Vietnamese knowledge into Statistical Machine Translation

Nôm scripts (chữ Nôm) are Vietnamese ancient scripts that were popularly used in Vietnam from the 10th century to the early 20th century. Nowadays, some automatic transliteration from Nôm scripts (NS) into Vietnamese National scripts (chữ Quốc ngữ - QN) systems were developed to help modern Vietnamese people acquire many valuable lessons and knowledge from previous generations through preserving the Sino-Nom heritage. However, these systems have still not performed well in many domains, except for Literature. Our research continues to employ Statistical Machine Translation (SMT) but expands the dataset up to 10 domains. Furthermore, we also focus on analyzing the impact of Chinese scripts with Sino-Vietnamese readings on Nôm script – National script and then integrating this knowledge into our transliteration model. Our experimental results show that our approach helps the model reach 94.04 BLEU score, dramatically increasing by 8.63 BLEU score in the genealogical domain and 0.31 BLEU score in the general model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 4th International Conference on Information Technology and Computer Communications

自引率

0.00%

发文量