{"title":"越南文文本规范化的混合方法","authors":"Nguyen Thi Thu Trang, Dang Xuan Bach, N. X. Tung","doi":"10.1145/3342827.3342851","DOIUrl":null,"url":null,"abstract":"This paper presents a hybrid method for normalizing written text often found on newspapers to its spoken form. To normalize raw text with a number of non-standard words (NSWs), a two-step model is proposed. The first step involves classifying NSWs into different categories using Random Forest. The latter one is to expand them, depending on their NSW types, into pronounceable syllables using a hybrid method. Most of numeric types can be expanded by well-defined rules while most of alphabetic ones must be expanded by a deep learning (i.e. sequence-to-sequence) model and a post adjustment. The experiment on a Vietnamese corpus with proposed NSW categories shows that the most ambiguous cases of the classification model are for abbreviation and read-as-sequence types, hence combined into one category for the latter expansion with more complex model and better context. The classification model gives an enhanced result of 99.20% with the category combination and the feature optimization. In the expansion, the sequence-to-sequence model shows a good result of 96.53% for abbreviations and 96.25% for loanwords with a post-adjustment for some completely wrong cases. This model can predict effectively the expansions of abbreviations in context.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"A Hybrid Method for Vietnamese Text Normalization\",\"authors\":\"Nguyen Thi Thu Trang, Dang Xuan Bach, N. X. Tung\",\"doi\":\"10.1145/3342827.3342851\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a hybrid method for normalizing written text often found on newspapers to its spoken form. To normalize raw text with a number of non-standard words (NSWs), a two-step model is proposed. The first step involves classifying NSWs into different categories using Random Forest. The latter one is to expand them, depending on their NSW types, into pronounceable syllables using a hybrid method. Most of numeric types can be expanded by well-defined rules while most of alphabetic ones must be expanded by a deep learning (i.e. sequence-to-sequence) model and a post adjustment. The experiment on a Vietnamese corpus with proposed NSW categories shows that the most ambiguous cases of the classification model are for abbreviation and read-as-sequence types, hence combined into one category for the latter expansion with more complex model and better context. The classification model gives an enhanced result of 99.20% with the category combination and the feature optimization. In the expansion, the sequence-to-sequence model shows a good result of 96.53% for abbreviations and 96.25% for loanwords with a post-adjustment for some completely wrong cases. This model can predict effectively the expansions of abbreviations in context.\",\"PeriodicalId\":254461,\"journal\":{\"name\":\"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-06-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3342827.3342851\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3342827.3342851","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
This paper presents a hybrid method for normalizing written text often found on newspapers to its spoken form. To normalize raw text with a number of non-standard words (NSWs), a two-step model is proposed. The first step involves classifying NSWs into different categories using Random Forest. The latter one is to expand them, depending on their NSW types, into pronounceable syllables using a hybrid method. Most of numeric types can be expanded by well-defined rules while most of alphabetic ones must be expanded by a deep learning (i.e. sequence-to-sequence) model and a post adjustment. The experiment on a Vietnamese corpus with proposed NSW categories shows that the most ambiguous cases of the classification model are for abbreviation and read-as-sequence types, hence combined into one category for the latter expansion with more complex model and better context. The classification model gives an enhanced result of 99.20% with the category combination and the feature optimization. In the expansion, the sequence-to-sequence model shows a good result of 96.53% for abbreviations and 96.25% for loanwords with a post-adjustment for some completely wrong cases. This model can predict effectively the expansions of abbreviations in context.