大规模数据并行哈希的设计与评估

Long Cheng, S. Kotoulas, Tomas E. Ward, G. Theodoropoulos
{"title":"大规模数据并行哈希的设计与评估","authors":"Long Cheng, S. Kotoulas, Tomas E. Ward, G. Theodoropoulos","doi":"10.1109/HiPC.2014.7116909","DOIUrl":null,"url":null,"abstract":"High-performance analytical data processing systems often run on servers with large amounts of memory. A common data structure used in such environment is the hash tables. This paper focuses on investigating efficient parallel hash algorithms for processing large-scale data. Currently, hash tables on distributed architectures are accessed one key at a time by local or remote threads while shared-memory approaches focus on accessing a single table with multiple threads. A relatively straightforward “bulk-operation” approach seems to have been neglected by researchers. In this work, using such a method, we propose a high-level parallel hashing framework, Structured Parallel Hashing, targeting efficiently processing massive data on distributed memory. We present a theoretical analysis of the proposed method and describe the design of our hashing implementations. The evaluation reveals a very interesting result - the proposed straightforward method can vastly outperform distributed hashing methods and can even offer performance comparable with approaches based on shared memory supercomputers which use specialized hardware predicates. Moreover, we characterize the performance of our hash implementations through extensive experiments, thereby allowing system developers to make a more informed choice for their high-performance applications.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Design and evaluation of parallel hashing over large-scale data\",\"authors\":\"Long Cheng, S. Kotoulas, Tomas E. Ward, G. Theodoropoulos\",\"doi\":\"10.1109/HiPC.2014.7116909\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High-performance analytical data processing systems often run on servers with large amounts of memory. A common data structure used in such environment is the hash tables. This paper focuses on investigating efficient parallel hash algorithms for processing large-scale data. Currently, hash tables on distributed architectures are accessed one key at a time by local or remote threads while shared-memory approaches focus on accessing a single table with multiple threads. A relatively straightforward “bulk-operation” approach seems to have been neglected by researchers. In this work, using such a method, we propose a high-level parallel hashing framework, Structured Parallel Hashing, targeting efficiently processing massive data on distributed memory. We present a theoretical analysis of the proposed method and describe the design of our hashing implementations. The evaluation reveals a very interesting result - the proposed straightforward method can vastly outperform distributed hashing methods and can even offer performance comparable with approaches based on shared memory supercomputers which use specialized hardware predicates. Moreover, we characterize the performance of our hash implementations through extensive experiments, thereby allowing system developers to make a more informed choice for their high-performance applications.\",\"PeriodicalId\":337777,\"journal\":{\"name\":\"2014 21st International Conference on High Performance Computing (HiPC)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 21st International Conference on High Performance Computing (HiPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HiPC.2014.7116909\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 21st International Conference on High Performance Computing (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2014.7116909","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

摘要

高性能分析数据处理系统通常运行在具有大量内存的服务器上。在这种环境中使用的常见数据结构是哈希表。本文主要研究用于处理大规模数据的高效并行哈希算法。目前,分布式架构上的哈希表是由本地或远程线程一次访问一个键,而共享内存方法侧重于使用多个线程访问单个表。一种相对简单的“批量手术”方法似乎被研究人员忽视了。在这项工作中,利用这种方法,我们提出了一个高级并行哈希框架,结构化并行哈希,旨在高效地处理分布式内存上的海量数据。我们对所提出的方法进行了理论分析,并描述了我们的哈希实现的设计。评估揭示了一个非常有趣的结果——所建议的直接方法可以大大优于分布式哈希方法,甚至可以提供与基于使用专用硬件谓词的共享内存超级计算机的方法相当的性能。此外,我们通过广泛的实验来描述我们的哈希实现的性能,从而允许系统开发人员为他们的高性能应用程序做出更明智的选择。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Design and evaluation of parallel hashing over large-scale data
High-performance analytical data processing systems often run on servers with large amounts of memory. A common data structure used in such environment is the hash tables. This paper focuses on investigating efficient parallel hash algorithms for processing large-scale data. Currently, hash tables on distributed architectures are accessed one key at a time by local or remote threads while shared-memory approaches focus on accessing a single table with multiple threads. A relatively straightforward “bulk-operation” approach seems to have been neglected by researchers. In this work, using such a method, we propose a high-level parallel hashing framework, Structured Parallel Hashing, targeting efficiently processing massive data on distributed memory. We present a theoretical analysis of the proposed method and describe the design of our hashing implementations. The evaluation reveals a very interesting result - the proposed straightforward method can vastly outperform distributed hashing methods and can even offer performance comparable with approaches based on shared memory supercomputers which use specialized hardware predicates. Moreover, we characterize the performance of our hash implementations through extensive experiments, thereby allowing system developers to make a more informed choice for their high-performance applications.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信