Yu Huang, Long Zheng, Pengcheng Yao, Jieshan Zhao, Xiaofei Liao, Hai Jin, Jingling Xue
{"title":"A Heterogeneous PIM Hardware-Software Co-Design for Energy-Efficient Graph Processing","authors":"Yu Huang, Long Zheng, Pengcheng Yao, Jieshan Zhao, Xiaofei Liao, Hai Jin, Jingling Xue","doi":"10.1109/IPDPS47924.2020.00076","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00076","url":null,"abstract":"Processing-In-Memory (PIM) is an emerging technology that addresses the memory bottleneck of graph processing. In general, analog memristor-based PIM promises high parallelism provided that the underlying matrix-structured crossbar can be fully utilized while digital CMOS-based PIM has a faster single-edge execution but its parallelism can be low. In this paper, we observe that there is no absolute winner between these two representative PIM technologies for graph applications, which often exhibit irregular workloads. To reap the best of both worlds, we introduce a new heterogeneous PIM hardware, called Hetraph, to facilitate energy-efficient graph processing. Hetraph incorporates memristor-based analog computation units (for high-parallelism computing) and CMOS-based digital computation cores (for efficient computing) on the same logic layer of a 3D die-stacked memory device. To maximize the hardware utilization, our software design offers a hardware heterogeneity-aware execution model and a workload offloading mechanism. For performance speedups, such a hardware-software co-design outperforms the state-of-the-art by 7.54 ×(CPU), 1.56 ×(GPU), 4.13× (memristor-based PIM) and 3.05× (CMOS-based PIM), on average. For energy savings, Hetraph reduces the energy consumption by 57.58× (CPU), 19.93× (GPU), 14.02 ×(memristor-based PIM) and 10.48 ×(CMOS-based PIM), on average.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"42 1","pages":"684-695"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73779533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DPF-ECC: Accelerating Elliptic Curve Cryptography with Floating-Point Computing Power of GPUs","authors":"Lili Gao, Fangyu Zheng, Niall Emmart, Jiankuo Dong, Jingqiang Lin, C. Weems","doi":"10.1109/IPDPS47924.2020.00058","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00058","url":null,"abstract":"Driven by artificial intelligence (AI) and computer vision industries, Graphics Processing Units (GPUs) are now rapidly achieving extraordinary computing power. In particular, the floating-point computing power, which is heavily relied on by graphics rendering and AI computation workload, is developing much faster in GPUs. Meanwhile, in many fields such as ecommerce and online finance, the demand for cryptographic operations for secure communications and authentication is also expanding.In this contribution, targeting the important cryptographic primitives widely used in TLS 1.3, etc., we implement Curve25519 and Edwards25519 with GPUs’ floating-point computing power, where various performance optimization methods are customized for the target platform, including novel big-number representations combined with a new floating-point-based computing algorithm, efficient merged reduction strategies, and curve-level acceleration. This paper reports record-setting performance for the elliptic-curve method: on TITAN V, we respectively achieve 7.21 and 77.30 million operations per second of unknown and known point multiplication of Edwards25519, and 13.55 million operations per second of point multiplication of Curve25519. To the best of our knowledge, this contribution is the first to show that floating-point-based ECC implementations can outperform the integer-based ones by a huge margin. The experimental result in Tesla P100 achieves over double performance of the existing fastest integer work on the same platform, and the result in TITAN V sets a record for the throughput which is 4.43 times better than the second.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"26 1","pages":"494-504"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81262260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tirthak Patel, Adam Wagenhäuser, C. Eibel, Timo Hönig, T. Zeiser, Devesh Tiwari
{"title":"What does Power Consumption Behavior of HPC Jobs Reveal? : Demystifying, Quantifying, and Predicting Power Consumption Characteristics","authors":"Tirthak Patel, Adam Wagenhäuser, C. Eibel, Timo Hönig, T. Zeiser, Devesh Tiwari","doi":"10.1109/IPDPS47924.2020.00087","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00087","url":null,"abstract":"As we approach exascale computing, large-scale HPC systems are becoming increasingly power-constrained, requiring them to run HPC workloads in an energy-efficient manner. The first step toward achieving this goal is to better understand, analyze, and quantify the power consumption characteristics of HPC jobs. However, there is a lack of understanding of the power consumption characteristics of HPC jobs which run on production HPC systems. Such characterization is required to guide the design of the next generation of power-aware resource management. To the best of our knowledge, we are the first study to open-source the data and analysis of power-consumption characteristics of HPC jobs and users from two medium-scale production HPC clusters.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"14 1","pages":"799-809"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85980620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shikha Singh, S. Madaminov, M. A. Bender, M. Ferdman, Ryan Johnson, Benjamin Moseley, H. Ngo, D. Nguyen, Soeren Olesen, R. Stirewalt, Geoffrey Washburn
{"title":"A Scheduling Approach to Incremental Maintenance of Datalog Programs","authors":"Shikha Singh, S. Madaminov, M. A. Bender, M. Ferdman, Ryan Johnson, Benjamin Moseley, H. Ngo, D. Nguyen, Soeren Olesen, R. Stirewalt, Geoffrey Washburn","doi":"10.1109/IPDPS47924.2020.00093","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00093","url":null,"abstract":"In this paper, we study the problem of incremental maintenance of Datalog programs and model it as a scheduling problem on DAGs. We design provably good time- and memory-efficient scheduling algorithms for (re)executing a Datalog program where some (but not necessarily all) of the inputs have changed. We prove that our schedulers, called LevelBased and LevelBased with lookahead, have asymptotically improved running time and space efficiency when compared with benchmark algorithms used in production at LogicBlox.The main result of the paper is a hybrid scheduler, which combines LevelBased with the production LogicBlox scheduler (or any other heuristic scheduler). The hybrid scheduler achieves strong worst-case guarantees and robustness without losing out on the best-case behavior of the production LogicBlox scheduler. Our experiments show that the hybrid scheduler results in similar or improved total execution times compared to LogicBlox scheduler, while consistently reducing the scheduling overhead—by as much as 50% on some datasets. This hybrid scheme requires little to no overhead but provides predictability and reliability, which are crucial in a commercial application such as LogicBlox.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"23 1","pages":"864-873"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89439085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mark Clark, Yingping Chen, Avinash Karanth, D. Ma, A. Louri
{"title":"DozzNoC: Reducing Static and Dynamic Energy in NoCs with Low-latency Voltage Regulators using Machine Learning","authors":"Mark Clark, Yingping Chen, Avinash Karanth, D. Ma, A. Louri","doi":"10.1109/IPDPS47924.2020.00011","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00011","url":null,"abstract":"Network-on-chips (NoCs) continues to be the choice of communication fabric in multicore architectures because the NoC effectively combines the resource efficiency of the bus with the parallelizability of the crossbar. As NoC suffers from both high static and dynamic energy consumption, power-gating and dynamic voltage and frequency scaling (DVFS) have been proposed in the literature to improve energy-efficiency. In this work, we propose DozzNoC, an adaptable power management technique that effectively combines power-gating and DVFS techniques to target both static power and dynamic energy reduction with a single inductor multiple output (SIMO) voltage regulator. The proposed power management design is further enhanced by machine learning techniques that predict future traffic load for proactive DVFS mode selection. DozzNoC utilizes a SIMO voltage regulator scheme that allows for fast, low-powered, and independently power-gated or voltage scaled routers such that each router and its outgoing links share the same voltage/frequency domain. Our simulation results using PARSEC and Splash-2 benchmarks on an 8 × 8 mesh network show that for a decrease of 7% in throughput, we can achieve an average dynamic energy savings of 25% and an average static power reduction of 53%.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"6 1","pages":"1-11"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72745490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Union: An Automatic Workload Manager for Accelerating Network Simulation","authors":"Xin Wang, M. Mubarak, Yao Kang, R. Ross, Z. Lan","doi":"10.1109/IPDPS47924.2020.00089","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00089","url":null,"abstract":"With the rapid growth of the machine learning applications, the workloads of future HPC systems are anticipated to be a mix of scientific simulation, big data analytics, and machine learning applications. Simulation is a great research vehicle to understand the performance implications of co-running scientific applications with big data and machine learning workloads on large-scale systems. In this paper, we present Union, a workload manager that provides an automatic framework to facilitate hybrid workload simulation in CODES. Furthermore, we use Union, along with CODES, to investigate various hybrid workloads composed of traditional simulation applications and emerging learning applications on two dragonfly systems. The experiment results show that both message latency and communication time are important performance metrics to evaluate network interference. Network interference on HPC applications is more reflected by the message latency variation, whereas ML application performance depends more on the communication time.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"22 1","pages":"821-830"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81868661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weaver: Efficient Coflow Scheduling in Heterogeneous Parallel Networks","authors":"X. Huang, Yiting Xia, T. Ng","doi":"10.1109/IPDPS47924.2020.00113","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00113","url":null,"abstract":"Leveraging application-level requirements expressed in Coflows has been shown to improve application-level communication efficiency. However, most existing works assume all application traffic is serviced by one monolithic network. This over-simplified assumption is no longer sufficient in a modern, evolving data center which operates on multiple generations of network fabrics, an architecture that we define as Heterogeneous Parallel Networks (HPNs). In this paper, we present the first scheduler, called Weaver, that addresses the Coflow management problem in HPNs. To exploit HPNs fully, achieving high communication efficiency for applications is crucial, yet it is also challenging because of the complex traffic patterns in applications and the heterogeneous bandwidth distribution in HPNs. Weaver addresses these challenges at two levels. At the microscopic level, for each application, Weaver leverages an efficient algorithm to exploit the distributed bandwidth in HPNs, which we proved to be within a constant factor of the optimal. At the macroscopic level involving multiple applications, Weaver can adopt a range of application traffic scheduling policies as desired by the system operator. Under realistic traffic, Weaver enables HPNs to service Coflows as efficiently as a monolithic network with equivalent aggregated capacity.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"136 1","pages":"1071-1081"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78189776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs","authors":"Cheng Li, Abdul Dakkak, Jinjun Xiong, Wei Wei, Lingjie Xu, Wen-Mei Hwu","doi":"10.1109/IPDPS47924.2020.00042","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00042","url":null,"abstract":"There has been a rapid proliferation of machine learning/deep learning (ML) models and wide adoption of them in many application domains. This has made profiling and characterization of ML model performance an increasingly pressing task for both hardware designers and system providers, as they would like to offer the best possible system to serve ML models with the target latency, throughput, cost, and energy requirements while maximizing resource utilization. Such an endeavor is challenging as the characteristics of an ML model depend on the interplay between the model, framework, system libraries, and the hardware (or the HW/SW stack). Existing profiling tools are disjoint, however, and only focus on profiling within a particular level of the stack, which limits the thoroughness and usefulness of the profiling results.This paper proposes XSP — an across-stack profiling design that gives a holistic and hierarchical view of ML model execution. XSP leverages distributed tracing to aggregate and correlate profile data from different sources. XSP introduces a leveled and iterative measurement approach that accurately captures the latencies at all levels of the HW/SW stack in spite of the profiling overhead. We couple the profiling design with an automated analysis pipeline to systematically analyze 65 state-of-the-art ML models. We demonstrate that XSP provides insights which would be difficult to discern otherwise.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"37 12","pages":"326-327"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141208698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Collection of IoT Devices Using an Energy-Constrained UAV","authors":"Yuchen Li, W. Liang, Wenzheng Xu, X. Jia","doi":"10.1109/IPDPS47924.2020.00072","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00072","url":null,"abstract":"In this paper, we study sensing data collection from IoT devices in a wireless sensor network, using an energy-constrained Unmanned Aerial Vehicle (UAV), where the sensory data is stored in IoT devices while the IoT devices may or may not be within the transmission range of each other. We formulate two novel data collection problems to fully or partially collect data from IoT devices using the UAV, by finding a closed tour for the UAV that includes hovering locations and the sojourn duration at each of the hovering locations such that the accumulative volume of data collected is maximized, subject to the energy capacity on the UAV, where the UAV consumes its energy on both hovering and flying from one hovering location to another hovering location. To this end, we first propose a novel data collection framework that enables the UAV to collect the sensory data from multiple IoT devices simultaneously if the IoT devices are within the hovering coverage range of the UAV. We then formulate two data collection maximization problems, and show that both of the problems are NP-hard. We instead devise efficient approximation and heuristic algorithms for the problems. We finally evaluate the performance of the proposed algorithms through experimental simulations. Experimental results demonstrated that the proposed algorithms are promising.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"103 5","pages":"644-653"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72855773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhijie Huang, Hong Jiang, Zhirong Shen, Hao Che, Nong Xiao, Ning Li
{"title":"Optimal Encoding and Decoding Algorithms for the RAID-6 Liberation Codes","authors":"Zhijie Huang, Hong Jiang, Zhirong Shen, Hao Che, Nong Xiao, Ning Li","doi":"10.1109/IPDPS47924.2020.00078","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00078","url":null,"abstract":"RAID-6 is gradually replacing RAID-5 as the dominant form of disk arrays due to its capability of tolerating concurrent failures of any two disks, as well as the case of encountering an uncorrectable read error during recovery. Implementing a RAID-6 system relies on some erasure coding schemes, and so far the most representative solutions are EVENODD codes [1], RDP codes [2] and Liberation codes [3], none of which has emerged as a clear \"all-around\" winner. In this paper, we are interested in revealing the undiscovered potential of the Liberation codes, since these codes have the following attractive features: (a) they have the best update performance, (b) they have better scalability, and (c) they are open-sourced and publicly available, as well as the following drawbacks: fair encoding performance and, more importantly, relatively poor decoding performance. Specificly, we present novel optimal encoding and decoding algorithms for the Liberation codes by introducing an alternative, geometric presentation of these codes. The proposed algorithms completely eliminate redundant computations during the encoding and decoding procedures by extracting and reusing common expressions between the two types of parity constraints, and do not involve any matrix operations on which the original algorithms are based. Our experiment results show that compared with the original solution, the proposed encoding and decoding algorithms reduce the number of XOR’s by up to 16 percent and 15 ~20 percent respectively, and the encoding and decoding throughputs are increased by 22.3 percent and at most 155 percent respectively. Moreover, the encoding complexity reaches the theoretical lower bound, while the decoding complexity is also very close to the theoretical lower bound.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"708-717"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75436775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}