{"title":"基于广义后缀数组的中文网页片段在线聚类算法","authors":"Zhang Hui, Wang Han, Yang Gao, Zhou Jingmin","doi":"10.1109/CYBERC.2009.5342183","DOIUrl":null,"url":null,"abstract":"As the information on the Internet increases dramatically, the web search engine has become an indispensable tool to search and locate the required information. Web snippets clustering can classify the search results and help users to narrow the search scope. This paper presents an online clustering algorithm for Chinese web snippets using common substrings. The algorithm firstly preprocesses the results of a search engine and extracts common substrings using Generalized Suffix Array. Then it builds a snippet-snippet similarity matrix by calculating similarities between every two snippets using common substring-based dimensional model. At last, the algorithm groups the web snippets using an improved hierarchical clustering algorithm. Theoretical analysis and experiments show that compared to traditional Chinese web snippet clustering algorithms based on Chinese word segmentation, our algorithm performs better both in the efficiency of clustering and the readability of the generated cluster labels.","PeriodicalId":222874,"journal":{"name":"2009 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"An online clustering algorithm for Chinese web snippets based on Generalized Suffix Array\",\"authors\":\"Zhang Hui, Wang Han, Yang Gao, Zhou Jingmin\",\"doi\":\"10.1109/CYBERC.2009.5342183\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the information on the Internet increases dramatically, the web search engine has become an indispensable tool to search and locate the required information. Web snippets clustering can classify the search results and help users to narrow the search scope. This paper presents an online clustering algorithm for Chinese web snippets using common substrings. The algorithm firstly preprocesses the results of a search engine and extracts common substrings using Generalized Suffix Array. Then it builds a snippet-snippet similarity matrix by calculating similarities between every two snippets using common substring-based dimensional model. At last, the algorithm groups the web snippets using an improved hierarchical clustering algorithm. Theoretical analysis and experiments show that compared to traditional Chinese web snippet clustering algorithms based on Chinese word segmentation, our algorithm performs better both in the efficiency of clustering and the readability of the generated cluster labels.\",\"PeriodicalId\":222874,\"journal\":{\"name\":\"2009 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CYBERC.2009.5342183\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CYBERC.2009.5342183","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An online clustering algorithm for Chinese web snippets based on Generalized Suffix Array
As the information on the Internet increases dramatically, the web search engine has become an indispensable tool to search and locate the required information. Web snippets clustering can classify the search results and help users to narrow the search scope. This paper presents an online clustering algorithm for Chinese web snippets using common substrings. The algorithm firstly preprocesses the results of a search engine and extracts common substrings using Generalized Suffix Array. Then it builds a snippet-snippet similarity matrix by calculating similarities between every two snippets using common substring-based dimensional model. At last, the algorithm groups the web snippets using an improved hierarchical clustering algorithm. Theoretical analysis and experiments show that compared to traditional Chinese web snippet clustering algorithms based on Chinese word segmentation, our algorithm performs better both in the efficiency of clustering and the readability of the generated cluster labels.