大规模流数据分析:聚类系数的案例研究

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW) Pub Date : 2010-04-19 DOI:10.1109/IPDPSW.2010.5470687

David Ediger, Karl Jiang, E. J. Riedy, David A. Bader

{"title":"大规模流数据分析:聚类系数的案例研究","authors":"David Ediger, Karl Jiang, E. J. Riedy, David A. Bader","doi":"10.1109/IPDPSW.2010.5470687","DOIUrl":null,"url":null,"abstract":"We present a new approach for parallel massive graph analysis of streaming, temporal data with a dynamic and extensible representation. Handling the constant stream of new data from health care, security, business, and social network applications requires new algorithms and data structures. We examine data structure and algorithm trade-offs that extract the parallelism necessary for high-performance updating analysis of massive graphs. Static analysis kernels often rely on storing input data in a specific structure. Maintaining these structures for each possible kernel with high data rates incurs a significant performance cost. A case study computing clustering coefficients on a general-purpose data structure demonstrates incremental updates can be more efficient than global recomputation. Within this kernel, we compare three methods for dynamically updating local clustering coefficients: a brute-force local recalculation, a sorting algorithm, and our new approximation method using a Bloom filter. On 32 processors of a Cray XMT with a synthetic scale-free graph of 224 ≈ 16 million vertices and 229 ≈ 537 million edges, the brute-force method processes a mean of over 50 000 updates per second and our Bloom filter approaches 200 000 updates per second.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"71","resultStr":"{\"title\":\"Massive streaming data analytics: A case study with clustering coefficients\",\"authors\":\"David Ediger, Karl Jiang, E. J. Riedy, David A. Bader\",\"doi\":\"10.1109/IPDPSW.2010.5470687\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a new approach for parallel massive graph analysis of streaming, temporal data with a dynamic and extensible representation. Handling the constant stream of new data from health care, security, business, and social network applications requires new algorithms and data structures. We examine data structure and algorithm trade-offs that extract the parallelism necessary for high-performance updating analysis of massive graphs. Static analysis kernels often rely on storing input data in a specific structure. Maintaining these structures for each possible kernel with high data rates incurs a significant performance cost. A case study computing clustering coefficients on a general-purpose data structure demonstrates incremental updates can be more efficient than global recomputation. Within this kernel, we compare three methods for dynamically updating local clustering coefficients: a brute-force local recalculation, a sorting algorithm, and our new approximation method using a Bloom filter. On 32 processors of a Cray XMT with a synthetic scale-free graph of 224 ≈ 16 million vertices and 229 ≈ 537 million edges, the brute-force method processes a mean of over 50 000 updates per second and our Bloom filter approaches 200 000 updates per second.\",\"PeriodicalId\":329280,\"journal\":{\"name\":\"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)\",\"volume\":\"197 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-04-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"71\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW.2010.5470687\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2010.5470687","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 71

摘要

我们提出了一种新的方法，以动态和可扩展的表示方式对流、时态数据进行并行海量图分析。处理来自医疗保健、安全、业务和社交网络应用程序的持续不断的新数据流需要新的算法和数据结构。我们研究了数据结构和算法权衡，以提取海量图的高性能更新分析所需的并行性。静态分析核通常依赖于在特定结构中存储输入数据。为每个可能的具有高数据速率的内核维护这些结构会带来巨大的性能成本。一个在通用数据结构上计算聚类系数的案例研究表明，增量更新可能比全局重新计算更有效。在这个内核中，我们比较了三种动态更新局部聚类系数的方法:强力局部重新计算、排序算法和我们使用Bloom过滤器的新近似方法。在具有224≈1600万个顶点和229≈5.37亿个边的合成无尺度图的Cray XMT的32个处理器上，暴力破解方法平均每秒处理超过5万次更新，我们的Bloom过滤器接近每秒20万次更新。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Massive streaming data analytics: A case study with clustering coefficients

We present a new approach for parallel massive graph analysis of streaming, temporal data with a dynamic and extensible representation. Handling the constant stream of new data from health care, security, business, and social network applications requires new algorithms and data structures. We examine data structure and algorithm trade-offs that extract the parallelism necessary for high-performance updating analysis of massive graphs. Static analysis kernels often rely on storing input data in a specific structure. Maintaining these structures for each possible kernel with high data rates incurs a significant performance cost. A case study computing clustering coefficients on a general-purpose data structure demonstrates incremental updates can be more efficient than global recomputation. Within this kernel, we compare three methods for dynamically updating local clustering coefficients: a brute-force local recalculation, a sorting algorithm, and our new approximation method using a Bloom filter. On 32 processors of a Cray XMT with a synthetic scale-free graph of 224 ≈ 16 million vertices and 229 ≈ 537 million edges, the brute-force method processes a mean of over 50 000 updates per second and our Bloom filter approaches 200 000 updates per second.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)

自引率

0.00%

发文量