An online clustering algorithm for Chinese web snippets based on Generalized Suffix Array

Zhang Hui, Wang Han, Yang Gao, Zhou Jingmin
{"title":"An online clustering algorithm for Chinese web snippets based on Generalized Suffix Array","authors":"Zhang Hui, Wang Han, Yang Gao, Zhou Jingmin","doi":"10.1109/CYBERC.2009.5342183","DOIUrl":null,"url":null,"abstract":"As the information on the Internet increases dramatically, the web search engine has become an indispensable tool to search and locate the required information. Web snippets clustering can classify the search results and help users to narrow the search scope. This paper presents an online clustering algorithm for Chinese web snippets using common substrings. The algorithm firstly preprocesses the results of a search engine and extracts common substrings using Generalized Suffix Array. Then it builds a snippet-snippet similarity matrix by calculating similarities between every two snippets using common substring-based dimensional model. At last, the algorithm groups the web snippets using an improved hierarchical clustering algorithm. Theoretical analysis and experiments show that compared to traditional Chinese web snippet clustering algorithms based on Chinese word segmentation, our algorithm performs better both in the efficiency of clustering and the readability of the generated cluster labels.","PeriodicalId":222874,"journal":{"name":"2009 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CYBERC.2009.5342183","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

As the information on the Internet increases dramatically, the web search engine has become an indispensable tool to search and locate the required information. Web snippets clustering can classify the search results and help users to narrow the search scope. This paper presents an online clustering algorithm for Chinese web snippets using common substrings. The algorithm firstly preprocesses the results of a search engine and extracts common substrings using Generalized Suffix Array. Then it builds a snippet-snippet similarity matrix by calculating similarities between every two snippets using common substring-based dimensional model. At last, the algorithm groups the web snippets using an improved hierarchical clustering algorithm. Theoretical analysis and experiments show that compared to traditional Chinese web snippet clustering algorithms based on Chinese word segmentation, our algorithm performs better both in the efficiency of clustering and the readability of the generated cluster labels.
基于广义后缀数组的中文网页片段在线聚类算法
随着互联网上的信息急剧增加,网络搜索引擎已经成为搜索和定位所需信息不可或缺的工具。Web片段聚类可以对搜索结果进行分类,帮助用户缩小搜索范围。提出了一种基于公共子字符串的中文网页片段在线聚类算法。该算法首先对搜索结果进行预处理,并使用广义后缀数组提取公共子字符串。然后利用基于公共子字符串的维数模型计算每两个片段之间的相似度,构建片段-片段相似矩阵。最后,采用改进的分层聚类算法对web片段进行分组。理论分析和实验表明,与传统的基于中文分词的中文网页片段聚类算法相比,本文算法在聚类效率和生成的聚类标签的可读性方面都有更好的表现。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信