基于关键词聚类的搜索结果中子主题的完全区分

Web Intell. Agent Syst. Pub Date : 2011-12-01 DOI:10.3233/WIA-2011-0222

Claudio Carpineto, M. D'Amico, Andrea Bernardini

{"title":"基于关键词聚类的搜索结果中子主题的完全区分","authors":"Claudio Carpineto, M. D'Amico, Andrea Bernardini","doi":"10.3233/WIA-2011-0222","DOIUrl":null,"url":null,"abstract":"We consider the problem of retrieving multiple documents relevant to the single subtopics of a given web query, termed “full-subtopic retrieval”. To solve this problem we present a novel search results clustering algorithm that generates clusters labeled by keyphrases. The keyphrases are extracted from the generalized suffix tree built from the search results and merged through an improved hierarchical agglomerative clustering procedure. Our approach has been implemented into KeySRC (Keyphrase-based Search Results Clustering), a full web clustering engine available online at http://keysrc.fub.it. We discuss how the keyphrase-based clustering algorithm can be used not only for browsing through the clustered search results but also for producing a re-ranked list of results emphasizing the diversity of top hits. Using a novel measure for evaluating full-subtopic retrieval performance, called “Subtopic Search Length under k document sufficiency”, and a test collection specifically designed for evaluating subtopic retrieval, we found that our approach was able to discriminate between the different subtopics present in search results in a very effective manner, with a clear improvement over other subtopic retrieval systems. In particular, browsing through KeySRC clusters was the best method to retrieve more documents per subtopic (i.e., k>1), whereas using the re-ranked list formed from KeySRC clusters was more effecive for retrieving just one document per subtopic (i.e., k=1).","PeriodicalId":263450,"journal":{"name":"Web Intell. Agent Syst.","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Full discrimination of subtopics in search results with keyphrase-based clustering\",\"authors\":\"Claudio Carpineto, M. D'Amico, Andrea Bernardini\",\"doi\":\"10.3233/WIA-2011-0222\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider the problem of retrieving multiple documents relevant to the single subtopics of a given web query, termed “full-subtopic retrieval”. To solve this problem we present a novel search results clustering algorithm that generates clusters labeled by keyphrases. The keyphrases are extracted from the generalized suffix tree built from the search results and merged through an improved hierarchical agglomerative clustering procedure. Our approach has been implemented into KeySRC (Keyphrase-based Search Results Clustering), a full web clustering engine available online at http://keysrc.fub.it. We discuss how the keyphrase-based clustering algorithm can be used not only for browsing through the clustered search results but also for producing a re-ranked list of results emphasizing the diversity of top hits. Using a novel measure for evaluating full-subtopic retrieval performance, called “Subtopic Search Length under k document sufficiency”, and a test collection specifically designed for evaluating subtopic retrieval, we found that our approach was able to discriminate between the different subtopics present in search results in a very effective manner, with a clear improvement over other subtopic retrieval systems. In particular, browsing through KeySRC clusters was the best method to retrieve more documents per subtopic (i.e., k>1), whereas using the re-ranked list formed from KeySRC clusters was more effecive for retrieving just one document per subtopic (i.e., k=1).\",\"PeriodicalId\":263450,\"journal\":{\"name\":\"Web Intell. Agent Syst.\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Web Intell. Agent Syst.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/WIA-2011-0222\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Web Intell. Agent Syst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/WIA-2011-0222","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

我们考虑检索与给定web查询的单个子主题相关的多个文档的问题，称为“全子主题检索”。为了解决这一问题，我们提出了一种新的搜索结果聚类算法，该算法生成由关键短语标记的聚类。关键词从搜索结果构建的广义后缀树中提取，并通过改进的分层凝聚聚类过程进行合并。我们的方法已经在KeySRC(基于关键字的搜索结果聚类)中实现，这是一个完整的网络聚类引擎，可以在http://keysrc.fub.it上在线获得。我们讨论了基于关键字短语的聚类算法如何不仅用于浏览聚类搜索结果，而且用于生成重新排序的结果列表，强调热门结果的多样性。使用一种评估全子主题检索性能的新方法，称为“k文档充分性下的子主题搜索长度”，以及专门为评估子主题检索而设计的测试集，我们发现我们的方法能够非常有效地区分搜索结果中存在的不同子主题，与其他子主题检索系统相比有明显的改进。特别是，浏览KeySRC集群是每个子主题检索更多文档(即k>1)的最佳方法，而使用由KeySRC集群形成的重新排序列表对于每个子主题仅检索一个文档(即k=1)更为有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Full discrimination of subtopics in search results with keyphrase-based clustering

We consider the problem of retrieving multiple documents relevant to the single subtopics of a given web query, termed “full-subtopic retrieval”. To solve this problem we present a novel search results clustering algorithm that generates clusters labeled by keyphrases. The keyphrases are extracted from the generalized suffix tree built from the search results and merged through an improved hierarchical agglomerative clustering procedure. Our approach has been implemented into KeySRC (Keyphrase-based Search Results Clustering), a full web clustering engine available online at http://keysrc.fub.it. We discuss how the keyphrase-based clustering algorithm can be used not only for browsing through the clustered search results but also for producing a re-ranked list of results emphasizing the diversity of top hits. Using a novel measure for evaluating full-subtopic retrieval performance, called “Subtopic Search Length under k document sufficiency”, and a test collection specifically designed for evaluating subtopic retrieval, we found that our approach was able to discriminate between the different subtopics present in search results in a very effective manner, with a clear improvement over other subtopic retrieval systems. In particular, browsing through KeySRC clusters was the best method to retrieve more documents per subtopic (i.e., k>1), whereas using the re-ranked list formed from KeySRC clusters was more effecive for retrieving just one document per subtopic (i.e., k=1).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Web Intell. Agent Syst.

自引率

0.00%

发文量