Clustering and Labeling a Web Scale Document Collection using Wikipedia clusters

Web-KR '14 Pub Date : 2014-11-03 DOI:10.1145/2663792.2663803
R. Nayak, Rachel Mills, R. D. Vries, S. Geva
{"title":"Clustering and Labeling a Web Scale Document Collection using Wikipedia clusters","authors":"R. Nayak, Rachel Mills, R. D. Vries, S. Geva","doi":"10.1145/2663792.2663803","DOIUrl":null,"url":null,"abstract":"Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.","PeriodicalId":289794,"journal":{"name":"Web-KR '14","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Web-KR '14","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2663792.2663803","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17

Abstract

Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.
使用维基百科聚类对Web规模文档集合进行聚类和标记
聚类是组织和分类web规模文档的一项重要技术。对网络上数十亿可用文档进行聚类所面临的主要挑战是所需的处理能力和可用数据集的绝对规模。更重要的是,为包含数十亿个文档和大量主题分类的一般web文档集合生成标签几乎是不可能的。然而,文档聚类通常是通过与文档标签的基本真实集进行比较来评估的。本文提出了一种聚类和标记解决方案,将维基百科聚类,并将ClueWeb12中的数亿个web文档映射到这些聚类上。这个解决方案是基于这样一个假设,即维基百科包含了如此广泛的不同主题,它代表了一个小规模的网络。我们发现,对于包含大约1000个聚类的维基百科聚类解决方案,可以在几天内在一台台式计算机上执行web规模的文档聚类和标记过程。执行具有更细粒度集群(如10,000或50,000)的解决方案需要更长的时间。这些结果是用一组外部数据来评估的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信