Clustering and Labeling a Web Scale Document Collection using Wikipedia clusters

Web-KR '14 Pub Date : 2014-11-03 DOI:10.1145/2663792.2663803

R. Nayak, Rachel Mills, R. D. Vries, S. Geva

{"title":"Clustering and Labeling a Web Scale Document Collection using Wikipedia clusters","authors":"R. Nayak, Rachel Mills, R. D. Vries, S. Geva","doi":"10.1145/2663792.2663803","DOIUrl":null,"url":null,"abstract":"Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.","PeriodicalId":289794,"journal":{"name":"Web-KR '14","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Web-KR '14","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2663792.2663803","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.

查看原文本刊更多论文

使用维基百科聚类对Web规模文档集合进行聚类和标记

聚类是组织和分类web规模文档的一项重要技术。对网络上数十亿可用文档进行聚类所面临的主要挑战是所需的处理能力和可用数据集的绝对规模。更重要的是，为包含数十亿个文档和大量主题分类的一般web文档集合生成标签几乎是不可能的。然而，文档聚类通常是通过与文档标签的基本真实集进行比较来评估的。本文提出了一种聚类和标记解决方案，将维基百科聚类，并将ClueWeb12中的数亿个web文档映射到这些聚类上。这个解决方案是基于这样一个假设，即维基百科包含了如此广泛的不同主题，它代表了一个小规模的网络。我们发现，对于包含大约1000个聚类的维基百科聚类解决方案，可以在几天内在一台台式计算机上执行web规模的文档聚类和标记过程。执行具有更细粒度集群(如10,000或50,000)的解决方案需要更长的时间。这些结果是用一组外部数据来评估的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Web-KR '14

自引率

0.00%

发文量