2020 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

筛选
英文 中文
On the Feasibility of Using Reduced-Precision Tensor Core Operations for Graph Analytics 在图分析中使用降精度张量核心运算的可行性
2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286152
J. Firoz, Ang Li, Jiajia Li, K. Barker
{"title":"On the Feasibility of Using Reduced-Precision Tensor Core Operations for Graph Analytics","authors":"J. Firoz, Ang Li, Jiajia Li, K. Barker","doi":"10.1109/HPEC43674.2020.9286152","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286152","url":null,"abstract":"Today's data-driven analytics and machine learning workload have been largely driven by the General-Purpose Graphics Processing Units (GPGPUs). To accelerate dense matrix multiplications on the GPUs, Tensor Core Units (TCUs) have been introduced in recent years. In this paper, we study linear-algebra-based and vertex-centric algorithms for various graph kernels on the GPUs with an objective of applying this new hardware feature to graph applications. We identify the potential stages in these graph kernels that can be executed on the Tensor Core Units. In particular, we leverage the reformulation of the reduction and scan operations in terms of matrix multiplication [1] on the TCUs. We demonstrate that executing these operations on the TCUs, available inside different graph kernels, can assist in establishing an end-to-end pipeline on the GPGPUs without depending on hand-tuned external libraries and still can deliver comparable performance for various graph analytics.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130462157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Combinatorial Tiling for Sparse Neural Networks 稀疏神经网络的组合平铺
2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286154
Filip Pawlowski, R. Bisseling, B. Uçar, Albert-Jan N. Yzelman
{"title":"Combinatorial Tiling for Sparse Neural Networks","authors":"Filip Pawlowski, R. Bisseling, B. Uçar, Albert-Jan N. Yzelman","doi":"10.1109/HPEC43674.2020.9286154","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286154","url":null,"abstract":"Sparse deep neural networks (DNNs) emerged as the result of search for networks with less storage and lower computational complexity. The sparse DNN inference is the task of using such trained DNN networks to classify a batch of input data. We propose an efficient, hybrid model- and data-parallel DNN inference using hypergraph models and partitioners. We exploit tiling and weak synchronization to increase cache reuse, hide load imbalance, and hide synchronization costs. Finally, a blocking approach allows application of this new hybrid inference procedure for deep neural networks. We initially experiment using the hybrid tiled inference approach only, using the first five layers of networks from the IEEE HPEC 2019 Graph Challenge, and attain up to 2 x speedup versus a data-parallel baseline.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134401835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Hash Table Scalability on Intel PIUMA Intel PIUMA上哈希表的可扩展性
2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286204
B. Seshasayee, J. Fryman, I. Hur
{"title":"Hash Table Scalability on Intel PIUMA","authors":"B. Seshasayee, J. Fryman, I. Hur","doi":"10.1109/HPEC43674.2020.9286204","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286204","url":null,"abstract":"The Intel PIUMA (Programmable and Integrated Unified Memory Architecture) is a scalable, massively multithreaded architecture designed to operate on unstructured data, with a global address space, fine-grain memory access and various novel features for latency hiding during data movement. Hash tables are a commonly used data structure with unstructured data, hence it is imperative that the performance and scaling for hash table usages are optimized for this architecture. We study three different hash table implementations on a PIUMA simulator to show that a dual-atomics based implementation, a unique feature in PIUMA, performs competitively both at larger scales and under hash collisions. Our implementations are able to achieve strong scaling up to 16,384 hardware threads.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115769728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A High Throughput Parallel Hash Table on FPGA using XOR-based Memory 基于xor存储器的FPGA高吞吐量并行哈希表
2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286199
Ruizhi Zhang, Sasindu Wijeratne, Yang Yang, S. Kuppannagari, V. Prasanna
{"title":"A High Throughput Parallel Hash Table on FPGA using XOR-based Memory","authors":"Ruizhi Zhang, Sasindu Wijeratne, Yang Yang, S. Kuppannagari, V. Prasanna","doi":"10.1109/HPEC43674.2020.9286199","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286199","url":null,"abstract":"Hash table is a fundamental data structure for quick search and retrieval of data. It is a key component in complex graph analytics and AI/ML applications. State-of-the-art parallel hash table implementations either make some simplifying assumptions such as supporting only a subset of hash table operations or employ optimizations that lead to performance that is highly data dependent and in the worst case can be similar to a sequential implementation. In contrast, in this work we develop a dynamic hash table that supports all the hash table queries - search, insert, delete, update, while allowing us to support $p$ parallel queries (p > 1) per clock cycle via $p$ processing engines (PEs) in the worst case i.e. the performance is data agnostic. We achieve this by implementing novel XOR based multi-ported block memories on FPGAs. Additionally, we develop a technique to optimize the memory requirement of the hash table if the ratio of search to insert/update/delete queries is known beforehand. We implement our design on state-of-the-art FPGA devices. Our design is scalable to 16 PEs and supports throughput up to 5926 MOPS. It matches the throughput of the state-of-the-art hash table design - FASTHash, which only supports search and insert operations. Comparing with the best FPGA design that supports the same set of operations, our hash table achieves up to 12.3 x speedup.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114596578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Hybrid Approach to HPC Cluster Telemetry and Hardware Log Analytics 高性能计算集群遥测和硬件日志分析的混合方法
2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286239
J. Thaler, Woong Shin, S. Roberts, James H. Rogers, Todd J. Rosedahl
{"title":"Hybrid Approach to HPC Cluster Telemetry and Hardware Log Analytics","authors":"J. Thaler, Woong Shin, S. Roberts, James H. Rogers, Todd J. Rosedahl","doi":"10.1109/HPEC43674.2020.9286239","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286239","url":null,"abstract":"The number of computer processing nodes and processor cores in cluster systems is growing rapidly. Discovering, and reacting to, a hardware or environmental issue in a timely manner enables proper fault isolation, improves quality of service, and improves system up-time. In the case of performance impacts and node outages, RAS policies can direct actions such as job quiescence or migration. Additionally, power consumption, thermal information, and utilization metrics can be used to provide cluster energy and cooling efficiency improvements as well as optimized job placement. This paper describes a highly scalable telemetry architecture that allows event aggregation, application of RAS policies, and provides the ability for cluster control system feedback. The architecture advances existing approaches by including both programmable policies, which are applied as events stream through the hierarchical network to persistence storage, and treatment of sensor telemetry in an extensible framework. This implementation has proven robust and is in use in both cloud and HPC environments including the Summit system of 4,608 nodes at Oak Ridge National Laboratory [5].","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"440 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125067032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Homomorphic Encryption Based Secure Sensor Data Processing 基于同态加密的传感器数据安全处理
2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286175
V. Gadepally, Mihailo Isakov, R. Agrawal, J. Kepner, K. Gettings, M. Kinsy
{"title":"Homomorphic Encryption Based Secure Sensor Data Processing","authors":"V. Gadepally, Mihailo Isakov, R. Agrawal, J. Kepner, K. Gettings, M. Kinsy","doi":"10.1109/HPEC43674.2020.9286175","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286175","url":null,"abstract":"Novel sensor processing algorithms face many hurdles to their adoption. Sensor processing environments have become increasingly difficult with an ever increasing array of threats. These threats have, in turn, raised the bar on deploying new capabilities. Many novel sensor processing algorithms exploit or induce randomness to boost algorithm performance. Co-designing this randomness with cryptographic features could be a powerful combination providing both improved algorithm performance and increased resiliency. The emerging field of signal processing in the encrypted domain has begun to explore such approaches. The development of this new class of algorithms will require new classes of tools. In particular, the foundational linear algebraic mathematics will need to be enhanced with cryptographic concepts to allow researchers to explore this new domain. This work highlights a relatively low overhead method that uses homomorphic encryption to enhance the resiliency of a part of a larger sensor processing pipeline.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128564692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Computing PageRank Scores of Web Crawl Data Using DGX A100 Clusters 使用DGX A100集群计算网页抓取数据的PageRank分数
2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286216
Seunghwa Kang, Alexandre Fender, Joe Eaton, Brad Rees
{"title":"Computing PageRank Scores of Web Crawl Data Using DGX A100 Clusters","authors":"Seunghwa Kang, Alexandre Fender, Joe Eaton, Brad Rees","doi":"10.1109/HPEC43674.2020.9286216","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286216","url":null,"abstract":"PageRank is a widely used graph analytics algorithm to rank vertices using relationship data. Large-scale Page Rank is challenging due to its irregular and communication intensive computational characteristics. We implemented Page Rank on NVIDIA's newly released DGX A100 cluster and compared the performance with two recent notable large-scale Page Rank computations using the Common Crawl dataset. The ShenTu framework computed Page Rank scores using a large number of custom microprocessors connected with an HPC class interconnect. The Hronos framework reported the state-of-the-art performance using 3000 commodity CPU nodes and 10 Gbps Ethernet. The Common Crawl dataset captures link relationships between web pages in a graph with 3.563 billion vertices and 128.736 billion edges. Our implementation demonstrated 13x faster PageRank iteration time than the Hronos framework using a cluster with NVLink connected 32 A100 GPUs.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"433 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132863338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Analysis of floating-point round-off error in linear algebra routines for graph clustering 图聚类线性代数例程的浮点舍入误差分析
2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286190
L. Yang, Alyson Fox
{"title":"Analysis of floating-point round-off error in linear algebra routines for graph clustering","authors":"L. Yang, Alyson Fox","doi":"10.1109/HPEC43674.2020.9286190","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286190","url":null,"abstract":"We explore the various ways rounding errors can impact the power method for calculating the Fielder vector for graph clustering. A rounding error analysis reveals that the best eigenpair that is computable with a certain floating point precision type has a worst-case error that scales to its unit round-off. Although rounding errors can accumulate in the power method at the worst-case bound, this behavior is not reflected in some practical examples. Furthermore, our numerical experiments show that rounding errors from the power method may satisfy the conditions necessary for the bounding of the mis-clustering rate and that the approximate eigenvectors with errors close to half precision unit round-off can yield sufficient clustering results for partitioning stochastic block model graphs.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131342321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Inference Benchmarking on HPC Systems 高性能计算系统的推理基准测试
2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286138
W. Brewer, G. Behm, A. Scheinine, Ben Parsons, Wesley Emeneker, Robert P. Trevino
{"title":"Inference Benchmarking on HPC Systems","authors":"W. Brewer, G. Behm, A. Scheinine, Ben Parsons, Wesley Emeneker, Robert P. Trevino","doi":"10.1109/HPEC43674.2020.9286138","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286138","url":null,"abstract":"As deep learning on edge computing systems has become more prevalent, investigation of architectures and configurations for optimal inference performance has become a critical step for proposed artificial intelligence solutions. While there has been considerable work in the development of hardware and software for high performance inferencing, there is little known about the performance of such systems on HPC architectures. In this paper, we address outstanding questions on the parallel inference performance on HPC systems. We report results and recommendations derived from evaluating iBench on multiple platforms in a variety of HPC configurations. We systematically benchmark single-GPU performance, single-node performance, and multi-node performance for maximum client-side and server-side inference throughput. In order to achieve linear speedup, we show that concurrent sending clients must be used, as opposed to sending large batch payloads parallelized across multiple GPUs. We show that client/server inferencing architectures add a considerable data transfer component that needs to be taken into consideration when benchmarking HPC system that benchmarks such as MLPerf do not measure. Finally, we investigate energy efficiency of GPUs for different levels of concurrency and batch sizes to report optimal configurations that minimize cost per inference.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126794693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Fast GPU Graph Contraction by Combining Efficient Shallow Searches and Post-Culling 结合高效浅搜索和后淘汰的快速GPU图收缩
2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286141
Roozbeh Karimi, David M. Koppelman, C. J. Michael
{"title":"Fast GPU Graph Contraction by Combining Efficient Shallow Searches and Post-Culling","authors":"Roozbeh Karimi, David M. Koppelman, C. J. Michael","doi":"10.1109/HPEC43674.2020.9286141","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286141","url":null,"abstract":"Efficient GPU single-source shortest-path (SSSP) queries of road network graphs can be realized by a technique called PHAST (Delling et al.) in which the graph is contracted (pre-processed using Geisberger's Contraction Hierarchies) once and the resulting contracted graph is queried as needed. PHAST accommodates GPUs' parallelism requirements well, resulting in efficient queries. For situations in which a graph is not available well in advance or changes frequently contraction time itself becomes significant. Karimi et al. recently described a GPU contraction technique, CU-CH, which significantly reduces the contraction time of small-to medium-sized graphs, reporting a speedup of over 20× on an NVidia P100 GPU. However CU-CH realizes little speedup on larger graphs, such as DIMACS’ USA and W. Europe graphs. The obstacle to faster contraction of larger graphs is the frequently performed witness path search (WPS). A WPS for a node determines which shortcut edges need to be added between the node's neighbors to maintain distances after the removal of the node. GPUs' strict thread convergence requirements and limited scratchpad preclude the bidirectional Dijkstra approach used in CPU implementations. Instead, CU-CH uses a two-hop-limit WPS tightly coded to fit GPU shared storage and to maintain thread convergence. Where two hops is sufficient speedup is high, but for larger graphs the hop limit exacts a toll due to the accumulation of unneeded shortcuts. The problem is overcome here by retaining the efficient CU-CH WPS but using it both for its original purpose and also to identify unnecessary shortcuts added in prior steps. The unnecessary shortcuts are culled (removed). Culling shortcuts not only dramatically reduces the time needed to contract a graph but also improves the quality of the contracted graph. For smaller graphs such as DIMACS Cal (travel time) contraction time is 61 % faster compared to CU-CH. For the DIMACS Europe and USA graphs contraction times are 40× and 12× faster, respectively. SSSP query times also improve dramatically, approaching those obtained on aggressively contracted graphs. The speedup over Geisberger's CPU code is over 100 times for NVidia VI00 GPUs on most graphs tried.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121125854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信