{"title":"缅甸信息检索的文本压缩","authors":"N. Lin, A. KudinovVitaly, Y. Soe","doi":"10.1145/3342827.3342830","DOIUrl":null,"url":null,"abstract":"Myanmar word segmentation is an important task for construction of dictionary file for Myanmar information retrieval and Myanmar text compression. Although Myanmar word segmentation using dictionary and orthography has been existed for Myanmar language, the performance of word segmentation depends on the coverage of the dictionary and training dataset and can cause out of vocabulary (OOV) problem, leading to lower precision and recall in information retrieval. And to compress Myanmar text, words in text needs to be recognized first. In this paper, we propose a new method for Myanmar word segmentation by local statistical dataset without the use of any additional data (e.g., training corpus) and new compressed Myanmar Information Retrieval (MIR) model which used End Tagged Dense Code (ETDC) text compressed method. The experimental results showed that the method can improve evaluation of vocabulary file with precision 75%, recall 87%, F-measure 80% and average compression ratio is 32% of texts for Myanmar language.","PeriodicalId":254461,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Text Compression for Myanmar Information Retrieval\",\"authors\":\"N. Lin, A. KudinovVitaly, Y. Soe\",\"doi\":\"10.1145/3342827.3342830\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Myanmar word segmentation is an important task for construction of dictionary file for Myanmar information retrieval and Myanmar text compression. Although Myanmar word segmentation using dictionary and orthography has been existed for Myanmar language, the performance of word segmentation depends on the coverage of the dictionary and training dataset and can cause out of vocabulary (OOV) problem, leading to lower precision and recall in information retrieval. And to compress Myanmar text, words in text needs to be recognized first. In this paper, we propose a new method for Myanmar word segmentation by local statistical dataset without the use of any additional data (e.g., training corpus) and new compressed Myanmar Information Retrieval (MIR) model which used End Tagged Dense Code (ETDC) text compressed method. The experimental results showed that the method can improve evaluation of vocabulary file with precision 75%, recall 87%, F-measure 80% and average compression ratio is 32% of texts for Myanmar language.\",\"PeriodicalId\":254461,\"journal\":{\"name\":\"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-06-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3342827.3342830\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3342827.3342830","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Text Compression for Myanmar Information Retrieval
Myanmar word segmentation is an important task for construction of dictionary file for Myanmar information retrieval and Myanmar text compression. Although Myanmar word segmentation using dictionary and orthography has been existed for Myanmar language, the performance of word segmentation depends on the coverage of the dictionary and training dataset and can cause out of vocabulary (OOV) problem, leading to lower precision and recall in information retrieval. And to compress Myanmar text, words in text needs to be recognized first. In this paper, we propose a new method for Myanmar word segmentation by local statistical dataset without the use of any additional data (e.g., training corpus) and new compressed Myanmar Information Retrieval (MIR) model which used End Tagged Dense Code (ETDC) text compressed method. The experimental results showed that the method can improve evaluation of vocabulary file with precision 75%, recall 87%, F-measure 80% and average compression ratio is 32% of texts for Myanmar language.