Clustering of Web Search Results Based on Combination of Links and In-Snippets

2011 Eighth Web Information Systems and Applications Conference Pub Date : 2011-10-21 DOI:10.1109/WISA.2011.28

Nan Yang, Yue Liu, Gang Yang

{"title":"Clustering of Web Search Results Based on Combination of Links and In-Snippets","authors":"Nan Yang, Yue Liu, Gang Yang","doi":"10.1109/WISA.2011.28","DOIUrl":null,"url":null,"abstract":"Search engine is a common tool to retrieve the information in the Web. But the current status of returned results is still far from satisfaction. Users have to be confronted with searching for a long result list to get the information really wanted. Many works focused on the post processing search results to facilitate users to examine the results. One of the common ways of post processing search result is clustering. Term-based clustering appears as first way to cluster the results. But this method is suffering from the poor quality while the processed pages have little text. Link-based clustering can conquer this problem. But the quality of clusters heavily depends on the number of in-links and out-links in common. In this paper, we propose that the short text attached to in-link is valuable information and it is helpful to reach high clustering quality. To distinguish them with general snippet, we name it as in-snippet. Based on the in-snippet, we propose a new clustering method that combines the links and the in-snippets together. In our method, similarity between pages consists of two parts : link similarity and term similarity. We designed related algorithm to implement clustering. In order to prevent bias from human judgments, the experiment datasets are collected from Open Directory Project(DMOZ). Due to DMOZ is human-edited directory, the datasets from DMOZ has higher quality and larger scale. We use entropy and f-measure to evaluate the quality of the final clusters. By being compared with the link-based and the pure term-based algorithms, our method outperforms others in clustering quality.","PeriodicalId":242633,"journal":{"name":"2011 Eighth Web Information Systems and Applications Conference","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 Eighth Web Information Systems and Applications Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WISA.2011.28","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Search engine is a common tool to retrieve the information in the Web. But the current status of returned results is still far from satisfaction. Users have to be confronted with searching for a long result list to get the information really wanted. Many works focused on the post processing search results to facilitate users to examine the results. One of the common ways of post processing search result is clustering. Term-based clustering appears as first way to cluster the results. But this method is suffering from the poor quality while the processed pages have little text. Link-based clustering can conquer this problem. But the quality of clusters heavily depends on the number of in-links and out-links in common. In this paper, we propose that the short text attached to in-link is valuable information and it is helpful to reach high clustering quality. To distinguish them with general snippet, we name it as in-snippet. Based on the in-snippet, we propose a new clustering method that combines the links and the in-snippets together. In our method, similarity between pages consists of two parts : link similarity and term similarity. We designed related algorithm to implement clustering. In order to prevent bias from human judgments, the experiment datasets are collected from Open Directory Project(DMOZ). Due to DMOZ is human-edited directory, the datasets from DMOZ has higher quality and larger scale. We use entropy and f-measure to evaluate the quality of the final clusters. By being compared with the link-based and the pure term-based algorithms, our method outperforms others in clustering quality.

查看原文本刊更多论文

基于链接和in - snippet组合的Web搜索结果聚类

搜索引擎是检索网络信息的常用工具。但目前返回结果的状况仍远不能令人满意。用户必须搜索一个很长的结果列表才能获得真正想要的信息。许多工作的重点是对搜索结果进行后处理，方便用户查看结果。聚类是对搜索结果进行后处理的常用方法之一。基于术语的聚类是聚类结果的第一种方法。但是这种方法的缺点是处理后的页面文本少，质量差。基于链接的集群可以解决这个问题。但是集群的质量很大程度上取决于共同的内链接和外链接的数量。在本文中，我们提出链接中附加的短文本是有价值的信息，有助于达到高聚类质量。为了与一般代码段区分，我们将其命名为in-snippet。在此基础上，提出了一种将链接和内片段结合在一起的聚类方法。在我们的方法中，页面之间的相似度包括两个部分:链接相似度和术语相似度。我们设计了相关算法来实现聚类。为了防止人为判断的偏差，实验数据集来自开放目录项目(Open Directory Project, DMOZ)。由于DMOZ是人工编辑的目录，因此DMOZ的数据集质量更高，规模更大。我们使用熵和f-measure来评估最终聚类的质量。通过与基于链接和纯基于词的聚类算法的比较，我们的方法在聚类质量上优于其他算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 Eighth Web Information Systems and Applications Conference

自引率

0.00%

发文量