Exploiting tag and word correlations for improved webpage clustering

SMUC '10 Pub Date : 2010-10-30 DOI:10.1145/1871985.1871989

Anusua Trivedi, Piyush Rai, S. Duvall, Hal Daumé

{"title":"Exploiting tag and word correlations for improved webpage clustering","authors":"Anusua Trivedi, Piyush Rai, S. Duvall, Hal Daumé","doi":"10.1145/1871985.1871989","DOIUrl":null,"url":null,"abstract":"Automatic clustering of webpages helps a number of information retrieval tasks, such as improving user interfaces, collection clustering, introducing diversity in search results, etc. Typically, webpage clustering algorithms only use features extracted from the page-text. However, the advent of social-bookmarking websites, such as StumbleUpon and Delicious, has led to a huge amount of user-generated content such as the tag information that is associated with the webpages. In this paper, we present a subspace based feature extraction approach which leverages tag information to complement the page-contents of a webpage to extract highly discriminative features, with the goal of improved clustering performance. In our approach, we consider page-text and tags as two separate views of the data, and learn a shared subspace that maximizes the correlation between the two views. Any clustering algorithm can then be applied in this subspace. We compare our subspace based approach with a number of baselines that use tag information in various other ways, and show that the subspace based approach leads to improved performance on the webpage clustering task. Although our results here are on the webpage clustering task, the same approach can be used for webpage classification as well. In the end, we also suggest possible future work for leveraging tag information in webpage clustering, especially when tag information is present for not all, but only for a small number of webpages.","PeriodicalId":244822,"journal":{"name":"SMUC '10","volume":"84 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SMUC '10","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1871985.1871989","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

Automatic clustering of webpages helps a number of information retrieval tasks, such as improving user interfaces, collection clustering, introducing diversity in search results, etc. Typically, webpage clustering algorithms only use features extracted from the page-text. However, the advent of social-bookmarking websites, such as StumbleUpon and Delicious, has led to a huge amount of user-generated content such as the tag information that is associated with the webpages. In this paper, we present a subspace based feature extraction approach which leverages tag information to complement the page-contents of a webpage to extract highly discriminative features, with the goal of improved clustering performance. In our approach, we consider page-text and tags as two separate views of the data, and learn a shared subspace that maximizes the correlation between the two views. Any clustering algorithm can then be applied in this subspace. We compare our subspace based approach with a number of baselines that use tag information in various other ways, and show that the subspace based approach leads to improved performance on the webpage clustering task. Although our results here are on the webpage clustering task, the same approach can be used for webpage classification as well. In the end, we also suggest possible future work for leveraging tag information in webpage clustering, especially when tag information is present for not all, but only for a small number of webpages.

查看原文本刊更多论文

利用标签和词的相关性来改进网页聚类

网页的自动聚类有助于许多信息检索任务，如改进用户界面、集合聚类、引入搜索结果多样性等。通常，网页聚类算法只使用从页面文本中提取的特征。然而，社交书签网站的出现，如StumbleUpon和Delicious，导致了大量的用户生成内容，如与网页相关的标签信息。在本文中，我们提出了一种基于子空间的特征提取方法，该方法利用标签信息来补充网页的页面内容，以提取高度判别的特征，以提高聚类性能。在我们的方法中，我们将页面文本和标记视为数据的两个独立视图，并学习一个共享子空间，使两个视图之间的相关性最大化。任何聚类算法都可以应用于该子空间。我们将基于子空间的方法与以各种其他方式使用标记信息的许多基线进行了比较，并表明基于子空间的方法可以提高网页聚类任务的性能。虽然我们这里的结果是关于网页聚类任务的，但同样的方法也可以用于网页分类。最后，我们还提出了在网页聚类中利用标签信息的可能的未来工作，特别是当标签信息不存在于所有网页时，而只存在于少数网页。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

SMUC '10

自引率

0.00%

发文量