PEGASUS:一个peta级图挖掘系统的实现和观察

2009 Ninth IEEE International Conference on Data Mining Pub Date : 2009-12-06 DOI:10.1109/ICDM.2009.14

U. Kang, Charalampos E. Tsourakakis, C. Faloutsos

{"title":"PEGASUS:一个peta级图挖掘系统的实现和观察","authors":"U. Kang, Charalampos E. Tsourakakis, C. Faloutsos","doi":"10.1109/ICDM.2009.14","DOIUrl":null,"url":null,"abstract":"In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the Hadoop platform, the open source version of MapReduce. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrix-vector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIM-V (Generalized Iterated Matrix-Vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web Graphs, thanks to Yahoo!, with 6,7 billion edges.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"101 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"735","resultStr":"{\"title\":\"PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations\",\"authors\":\"U. Kang, Charalampos E. Tsourakakis, C. Faloutsos\",\"doi\":\"10.1109/ICDM.2009.14\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the Hadoop platform, the open source version of MapReduce. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrix-vector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIM-V (Generalized Iterated Matrix-Vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web Graphs, thanks to Yahoo!, with 6,7 billion edges.\",\"PeriodicalId\":247645,\"journal\":{\"name\":\"2009 Ninth IEEE International Conference on Data Mining\",\"volume\":\"101 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"735\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 Ninth IEEE International Conference on Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDM.2009.14\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Ninth IEEE International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2009.14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 735

摘要

在本文中，我们描述了PEGASUS，一个开源的Peta图挖掘库，它执行典型的图挖掘任务，如计算图的直径，计算每个节点的半径和寻找连接的组件。当图形的大小达到千兆、Tera或peta字节时，对这样一个库的需求也在增长。据我们所知，PEGASUS是第一个这样的库，实现在Hadoop平台(MapReduce的开源版本)之上。许多图挖掘操作(PageRank，光谱聚类，直径估计，连接组件等)本质上是一个重复的矩阵向量乘法。在本文中，我们描述了PEGASUS的一个非常重要的原语，称为gimv(广义迭代矩阵向量乘法)。吉姆- v是高度优化的，实现了(a)在可用机器数量上的良好扩展(b)在边缘数量上的线性运行时间，以及(c)比未优化版本的吉姆- v快5倍以上的性能。我们的实验是在M45上进行的，它是世界上排名前50位的超级计算机之一。我们在几个真实的图表上报告了我们的发现，包括一个最大的公开可用的网络图表，感谢雅虎!，有67亿条边。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the Hadoop platform, the open source version of MapReduce. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrix-vector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIM-V (Generalized Iterated Matrix-Vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web Graphs, thanks to Yahoo!, with 6,7 billion edges.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2009 Ninth IEEE International Conference on Data Mining

自引率

0.00%

发文量