{"title":"基于Kolmogorov复杂度理论的中文编码类型识别","authors":"Gang He, Ning Zhu, Xiaochun Wu, Qiuchen Xu","doi":"10.1109/ICNIDC.2010.5657789","DOIUrl":null,"url":null,"abstract":"Identification of Chinese coding type is a major and challenging issue in Chinese web content audit and analysis. In this paper we develop a novel algorithm based on the theory of Kolmogorov complexity to identify the coding type of Chinese characters of a given text segment. An array of text compressors are used as filters to evaluate the information distance of text under examination and the training corpus coded in different coding type. The information distance can be used to decide the coding type according to the Kolmogorov theory. In this paper a particular compressing algorithm is used to minimize computing complexity by separating coding book training stage and compressing stage. Finally, we present the experimental results through which the accuracy and performance of the algorithm is confirmed. The result also proves that this algorithm is especially efficient when short text segment is under examination comparing with the n-gram algorithms.","PeriodicalId":348778,"journal":{"name":"2010 2nd IEEE InternationalConference on Network Infrastructure and Digital Content","volume":"414 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Chinese coding type identification based on Kolmogorov complexity theory\",\"authors\":\"Gang He, Ning Zhu, Xiaochun Wu, Qiuchen Xu\",\"doi\":\"10.1109/ICNIDC.2010.5657789\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Identification of Chinese coding type is a major and challenging issue in Chinese web content audit and analysis. In this paper we develop a novel algorithm based on the theory of Kolmogorov complexity to identify the coding type of Chinese characters of a given text segment. An array of text compressors are used as filters to evaluate the information distance of text under examination and the training corpus coded in different coding type. The information distance can be used to decide the coding type according to the Kolmogorov theory. In this paper a particular compressing algorithm is used to minimize computing complexity by separating coding book training stage and compressing stage. Finally, we present the experimental results through which the accuracy and performance of the algorithm is confirmed. The result also proves that this algorithm is especially efficient when short text segment is under examination comparing with the n-gram algorithms.\",\"PeriodicalId\":348778,\"journal\":{\"name\":\"2010 2nd IEEE InternationalConference on Network Infrastructure and Digital Content\",\"volume\":\"414 1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-12-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 2nd IEEE InternationalConference on Network Infrastructure and Digital Content\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICNIDC.2010.5657789\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 2nd IEEE InternationalConference on Network Infrastructure and Digital Content","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNIDC.2010.5657789","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Chinese coding type identification based on Kolmogorov complexity theory
Identification of Chinese coding type is a major and challenging issue in Chinese web content audit and analysis. In this paper we develop a novel algorithm based on the theory of Kolmogorov complexity to identify the coding type of Chinese characters of a given text segment. An array of text compressors are used as filters to evaluate the information distance of text under examination and the training corpus coded in different coding type. The information distance can be used to decide the coding type according to the Kolmogorov theory. In this paper a particular compressing algorithm is used to minimize computing complexity by separating coding book training stage and compressing stage. Finally, we present the experimental results through which the accuracy and performance of the algorithm is confirmed. The result also proves that this algorithm is especially efficient when short text segment is under examination comparing with the n-gram algorithms.