越南文文本规范化的混合方法

Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval Pub Date : 2019-06-28 DOI:10.1145/3342827.3342851

Nguyen Thi Thu Trang, Dang Xuan Bach, N. X. Tung

{"title":"越南文文本规范化的混合方法","authors":"Nguyen Thi Thu Trang, Dang Xuan Bach, N. X. Tung","doi":"10.1145/3342827.3342851","DOIUrl":null,"url":null,"abstract":"This paper presents a hybrid method for normalizing written text often found on newspapers to its spoken form. To normalize raw text with a number of non-standard words (NSWs), a two-step model is proposed. The first step involves classifying NSWs into different categories using Random Forest. The latter one is to expand them, depending on their NSW types, into pronounceable syllables using a hybrid method. Most of numeric types can be expanded by well-defined rules while most of alphabetic ones must be expanded by a deep learning (i.e. sequence-to-sequence) model and a post adjustment. The experiment on a Vietnamese corpus with proposed NSW categories shows that the most ambiguous cases of the classification model are for abbreviation and read-as-sequence types, hence combined into one category for the latter expansion with more complex model and better context. The classification model gives an enhanced result of 99.20% with the category combination and the feature optimization. In the expansion, the sequence-to-sequence model shows a good result of 96.53% for abbreviations and 96.25% for loanwords with a post-adjustment for some completely wrong cases. This model can predict effectively the expansions of abbreviations in context.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"A Hybrid Method for Vietnamese Text Normalization\",\"authors\":\"Nguyen Thi Thu Trang, Dang Xuan Bach, N. X. Tung\",\"doi\":\"10.1145/3342827.3342851\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a hybrid method for normalizing written text often found on newspapers to its spoken form. To normalize raw text with a number of non-standard words (NSWs), a two-step model is proposed. The first step involves classifying NSWs into different categories using Random Forest. The latter one is to expand them, depending on their NSW types, into pronounceable syllables using a hybrid method. Most of numeric types can be expanded by well-defined rules while most of alphabetic ones must be expanded by a deep learning (i.e. sequence-to-sequence) model and a post adjustment. The experiment on a Vietnamese corpus with proposed NSW categories shows that the most ambiguous cases of the classification model are for abbreviation and read-as-sequence types, hence combined into one category for the latter expansion with more complex model and better context. The classification model gives an enhanced result of 99.20% with the category combination and the feature optimization. In the expansion, the sequence-to-sequence model shows a good result of 96.53% for abbreviations and 96.25% for loanwords with a post-adjustment for some completely wrong cases. This model can predict effectively the expansions of abbreviations in context.\",\"PeriodicalId\":254461,\"journal\":{\"name\":\"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-06-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3342827.3342851\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3342827.3342851","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

本文提出了一种将报纸上经常出现的书面文本规范化为口语形式的混合方法。为了对含有大量非标准词的原始文本进行规范化，提出了一个两步模型。第一步是使用随机森林将新南威尔士州分为不同的类别。后者是根据它们的NSW类型，使用混合方法将它们扩展成可发音的音节。大多数数字类型可以通过定义良好的规则进行扩展，而大多数字母类型必须通过深度学习(即序列到序列)模型和工作地点调整数进行扩展。在一个越南语语料库上的实验表明，该分类模型最模糊的情况是缩写和读为序列类型，因此将其合并到一个类别中，以使后者具有更复杂的模型和更好的上下文。通过分类组合和特征优化，该分类模型的分类效率提高了99.20%。在扩展中，序列到序列模型对缩略语和外来词的识别率分别为96.53%和96.25%，并对一些完全错误的情况进行了后调整。该模型可以有效地预测缩略语在语境中的扩展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Hybrid Method for Vietnamese Text Normalization

This paper presents a hybrid method for normalizing written text often found on newspapers to its spoken form. To normalize raw text with a number of non-standard words (NSWs), a two-step model is proposed. The first step involves classifying NSWs into different categories using Random Forest. The latter one is to expand them, depending on their NSW types, into pronounceable syllables using a hybrid method. Most of numeric types can be expanded by well-defined rules while most of alphabetic ones must be expanded by a deep learning (i.e. sequence-to-sequence) model and a post adjustment. The experiment on a Vietnamese corpus with proposed NSW categories shows that the most ambiguous cases of the classification model are for abbreviation and read-as-sequence types, hence combined into one category for the latter expansion with more complex model and better context. The classification model gives an enhanced result of 99.20% with the category combination and the feature optimization. In the expansion, the sequence-to-sequence model shows a good result of 96.53% for abbreviations and 96.25% for loanwords with a post-adjustment for some completely wrong cases. This model can predict effectively the expansions of abbreviations in context.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval

自引率

0.00%

发文量