2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第9页

A Heterogeneous PIM Hardware-Software Co-Design for Energy-Efficient Graph Processing 面向节能图形处理的异构PIM软硬件协同设计

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00076

Yu Huang, Long Zheng, Pengcheng Yao, Jieshan Zhao, Xiaofei Liao, Hai Jin, Jingling Xue

{"title":"A Heterogeneous PIM Hardware-Software Co-Design for Energy-Efficient Graph Processing","authors":"Yu Huang, Long Zheng, Pengcheng Yao, Jieshan Zhao, Xiaofei Liao, Hai Jin, Jingling Xue","doi":"10.1109/IPDPS47924.2020.00076","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00076","url":null,"abstract":"Processing-In-Memory (PIM) is an emerging technology that addresses the memory bottleneck of graph processing. In general, analog memristor-based PIM promises high parallelism provided that the underlying matrix-structured crossbar can be fully utilized while digital CMOS-based PIM has a faster single-edge execution but its parallelism can be low. In this paper, we observe that there is no absolute winner between these two representative PIM technologies for graph applications, which often exhibit irregular workloads. To reap the best of both worlds, we introduce a new heterogeneous PIM hardware, called Hetraph, to facilitate energy-efficient graph processing. Hetraph incorporates memristor-based analog computation units (for high-parallelism computing) and CMOS-based digital computation cores (for efficient computing) on the same logic layer of a 3D die-stacked memory device. To maximize the hardware utilization, our software design offers a hardware heterogeneity-aware execution model and a workload offloading mechanism. For performance speedups, such a hardware-software co-design outperforms the state-of-the-art by 7.54 ×(CPU), 1.56 ×(GPU), 4.13× (memristor-based PIM) and 3.05× (CMOS-based PIM), on average. For energy savings, Hetraph reduces the energy consumption by 57.58× (CPU), 19.93× (GPU), 14.02 ×(memristor-based PIM) and 10.48 ×(CMOS-based PIM), on average.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"42 1","pages":"684-695"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73779533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

DPF-ECC: Accelerating Elliptic Curve Cryptography with Floating-Point Computing Power of GPUs DPF-ECC:利用gpu的浮点计算能力加速椭圆曲线加密

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00058

Lili Gao, Fangyu Zheng, Niall Emmart, Jiankuo Dong, Jingqiang Lin, C. Weems

{"title":"DPF-ECC: Accelerating Elliptic Curve Cryptography with Floating-Point Computing Power of GPUs","authors":"Lili Gao, Fangyu Zheng, Niall Emmart, Jiankuo Dong, Jingqiang Lin, C. Weems","doi":"10.1109/IPDPS47924.2020.00058","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00058","url":null,"abstract":"Driven by artificial intelligence (AI) and computer vision industries, Graphics Processing Units (GPUs) are now rapidly achieving extraordinary computing power. In particular, the floating-point computing power, which is heavily relied on by graphics rendering and AI computation workload, is developing much faster in GPUs. Meanwhile, in many fields such as ecommerce and online finance, the demand for cryptographic operations for secure communications and authentication is also expanding.In this contribution, targeting the important cryptographic primitives widely used in TLS 1.3, etc., we implement Curve25519 and Edwards25519 with GPUs’ floating-point computing power, where various performance optimization methods are customized for the target platform, including novel big-number representations combined with a new floating-point-based computing algorithm, efficient merged reduction strategies, and curve-level acceleration. This paper reports record-setting performance for the elliptic-curve method: on TITAN V, we respectively achieve 7.21 and 77.30 million operations per second of unknown and known point multiplication of Edwards25519, and 13.55 million operations per second of point multiplication of Curve25519. To the best of our knowledge, this contribution is the first to show that floating-point-based ECC implementations can outperform the integer-based ones by a huge margin. The experimental result in Tesla P100 achieves over double performance of the existing fastest integer work on the same platform, and the result in TITAN V sets a record for the throughput which is 4.43 times better than the second.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"26 1","pages":"494-504"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81262260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

What does Power Consumption Behavior of HPC Jobs Reveal? : Demystifying, Quantifying, and Predicting Power Consumption Characteristics 高性能计算作业的功耗行为揭示了什么?:揭秘、量化和预测电力消耗特性

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00087

Tirthak Patel, Adam Wagenhäuser, C. Eibel, Timo Hönig, T. Zeiser, Devesh Tiwari

引用次数: 18

A Scheduling Approach to Incremental Maintenance of Datalog Programs 数据程序增量维护的调度方法

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00093

Shikha Singh, S. Madaminov, M. A. Bender, M. Ferdman, Ryan Johnson, Benjamin Moseley, H. Ngo, D. Nguyen, Soeren Olesen, R. Stirewalt, Geoffrey Washburn

{"title":"A Scheduling Approach to Incremental Maintenance of Datalog Programs","authors":"Shikha Singh, S. Madaminov, M. A. Bender, M. Ferdman, Ryan Johnson, Benjamin Moseley, H. Ngo, D. Nguyen, Soeren Olesen, R. Stirewalt, Geoffrey Washburn","doi":"10.1109/IPDPS47924.2020.00093","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00093","url":null,"abstract":"In this paper, we study the problem of incremental maintenance of Datalog programs and model it as a scheduling problem on DAGs. We design provably good time- and memory-efficient scheduling algorithms for (re)executing a Datalog program where some (but not necessarily all) of the inputs have changed. We prove that our schedulers, called LevelBased and LevelBased with lookahead, have asymptotically improved running time and space efficiency when compared with benchmark algorithms used in production at LogicBlox.The main result of the paper is a hybrid scheduler, which combines LevelBased with the production LogicBlox scheduler (or any other heuristic scheduler). The hybrid scheduler achieves strong worst-case guarantees and robustness without losing out on the best-case behavior of the production LogicBlox scheduler. Our experiments show that the hybrid scheduler results in similar or improved total execution times compared to LogicBlox scheduler, while consistently reducing the scheduling overhead—by as much as 50% on some datasets. This hybrid scheme requires little to no overhead but provides predictability and reliability, which are crucial in a commercial application such as LogicBlox.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"23 1","pages":"864-873"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89439085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

DozzNoC: Reducing Static and Dynamic Energy in NoCs with Low-latency Voltage Regulators using Machine Learning DozzNoC:利用机器学习降低低延迟稳压器在noc中的静态和动态能量

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00011

Mark Clark, Yingping Chen, Avinash Karanth, D. Ma, A. Louri

{"title":"DozzNoC: Reducing Static and Dynamic Energy in NoCs with Low-latency Voltage Regulators using Machine Learning","authors":"Mark Clark, Yingping Chen, Avinash Karanth, D. Ma, A. Louri","doi":"10.1109/IPDPS47924.2020.00011","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00011","url":null,"abstract":"Network-on-chips (NoCs) continues to be the choice of communication fabric in multicore architectures because the NoC effectively combines the resource efficiency of the bus with the parallelizability of the crossbar. As NoC suffers from both high static and dynamic energy consumption, power-gating and dynamic voltage and frequency scaling (DVFS) have been proposed in the literature to improve energy-efficiency. In this work, we propose DozzNoC, an adaptable power management technique that effectively combines power-gating and DVFS techniques to target both static power and dynamic energy reduction with a single inductor multiple output (SIMO) voltage regulator. The proposed power management design is further enhanced by machine learning techniques that predict future traffic load for proactive DVFS mode selection. DozzNoC utilizes a SIMO voltage regulator scheme that allows for fast, low-powered, and independently power-gated or voltage scaled routers such that each router and its outgoing links share the same voltage/frequency domain. Our simulation results using PARSEC and Splash-2 benchmarks on an 8 × 8 mesh network show that for a decrease of 7% in throughput, we can achieve an average dynamic energy savings of 25% and an average static power reduction of 53%.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"6 1","pages":"1-11"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72745490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Union: An Automatic Workload Manager for Accelerating Network Simulation Union:用于加速网络仿真的自动工作负载管理器

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00089

Xin Wang, M. Mubarak, Yao Kang, R. Ross, Z. Lan

引用次数: 6

Weaver: Efficient Coflow Scheduling in Heterogeneous Parallel Networks Weaver:异构并行网络中的高效协同流调度

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00113

X. Huang, Yiting Xia, T. Ng

{"title":"Weaver: Efficient Coflow Scheduling in Heterogeneous Parallel Networks","authors":"X. Huang, Yiting Xia, T. Ng","doi":"10.1109/IPDPS47924.2020.00113","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00113","url":null,"abstract":"Leveraging application-level requirements expressed in Coflows has been shown to improve application-level communication efficiency. However, most existing works assume all application traffic is serviced by one monolithic network. This over-simplified assumption is no longer sufficient in a modern, evolving data center which operates on multiple generations of network fabrics, an architecture that we define as Heterogeneous Parallel Networks (HPNs). In this paper, we present the first scheduler, called Weaver, that addresses the Coflow management problem in HPNs. To exploit HPNs fully, achieving high communication efficiency for applications is crucial, yet it is also challenging because of the complex traffic patterns in applications and the heterogeneous bandwidth distribution in HPNs. Weaver addresses these challenges at two levels. At the microscopic level, for each application, Weaver leverages an efficient algorithm to exploit the distributed bandwidth in HPNs, which we proved to be within a constant factor of the optimal. At the macroscopic level involving multiple applications, Weaver can adopt a range of application traffic scheduling policies as desired by the system operator. Under realistic traffic, Weaver enables HPNs to service Coflows as efficiently as a monolithic network with equivalent aggregated capacity.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"136 1","pages":"1071-1081"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78189776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs XSP：跨栈剖析和分析 GPU 上的机器学习模型

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00042

Cheng Li, Abdul Dakkak, Jinjun Xiong, Wei Wei, Lingjie Xu, Wen-Mei Hwu

{"title":"XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs","authors":"Cheng Li, Abdul Dakkak, Jinjun Xiong, Wei Wei, Lingjie Xu, Wen-Mei Hwu","doi":"10.1109/IPDPS47924.2020.00042","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00042","url":null,"abstract":"There has been a rapid proliferation of machine learning/deep learning (ML) models and wide adoption of them in many application domains. This has made profiling and characterization of ML model performance an increasingly pressing task for both hardware designers and system providers, as they would like to offer the best possible system to serve ML models with the target latency, throughput, cost, and energy requirements while maximizing resource utilization. Such an endeavor is challenging as the characteristics of an ML model depend on the interplay between the model, framework, system libraries, and the hardware (or the HW/SW stack). Existing profiling tools are disjoint, however, and only focus on profiling within a particular level of the stack, which limits the thoroughness and usefulness of the profiling results.This paper proposes XSP — an across-stack profiling design that gives a holistic and hierarchical view of ML model execution. XSP leverages distributed tracing to aggregate and correlate profile data from different sources. XSP introduces a leveled and iterative measurement approach that accurately captures the latencies at all levels of the HW/SW stack in spite of the profiling overhead. We couple the profiling design with an automated analysis pipeline to systematically analyze 65 state-of-the-art ML models. We demonstrate that XSP provides insights which would be difficult to discern otherwise.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"37 12","pages":"326-327"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141208698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Data Collection of IoT Devices Using an Energy-Constrained UAV 使用能量受限的无人机进行物联网设备的数据收集

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00072

Yuchen Li, W. Liang, Wenzheng Xu, X. Jia

{"title":"Data Collection of IoT Devices Using an Energy-Constrained UAV","authors":"Yuchen Li, W. Liang, Wenzheng Xu, X. Jia","doi":"10.1109/IPDPS47924.2020.00072","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00072","url":null,"abstract":"In this paper, we study sensing data collection from IoT devices in a wireless sensor network, using an energy-constrained Unmanned Aerial Vehicle (UAV), where the sensory data is stored in IoT devices while the IoT devices may or may not be within the transmission range of each other. We formulate two novel data collection problems to fully or partially collect data from IoT devices using the UAV, by finding a closed tour for the UAV that includes hovering locations and the sojourn duration at each of the hovering locations such that the accumulative volume of data collected is maximized, subject to the energy capacity on the UAV, where the UAV consumes its energy on both hovering and flying from one hovering location to another hovering location. To this end, we first propose a novel data collection framework that enables the UAV to collect the sensory data from multiple IoT devices simultaneously if the IoT devices are within the hovering coverage range of the UAV. We then formulate two data collection maximization problems, and show that both of the problems are NP-hard. We instead devise efficient approximation and heuristic algorithms for the problems. We finally evaluate the performance of the proposed algorithms through experimental simulations. Experimental results demonstrated that the proposed algorithms are promising.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"103 5","pages":"644-653"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72855773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Optimal Encoding and Decoding Algorithms for the RAID-6 Liberation Codes RAID-6解放码的最优编解码算法

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00078

Zhijie Huang, Hong Jiang, Zhirong Shen, Hao Che, Nong Xiao, Ning Li

{"title":"Optimal Encoding and Decoding Algorithms for the RAID-6 Liberation Codes","authors":"Zhijie Huang, Hong Jiang, Zhirong Shen, Hao Che, Nong Xiao, Ning Li","doi":"10.1109/IPDPS47924.2020.00078","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00078","url":null,"abstract":"RAID-6 is gradually replacing RAID-5 as the dominant form of disk arrays due to its capability of tolerating concurrent failures of any two disks, as well as the case of encountering an uncorrectable read error during recovery. Implementing a RAID-6 system relies on some erasure coding schemes, and so far the most representative solutions are EVENODD codes [1], RDP codes [2] and Liberation codes [3], none of which has emerged as a clear \"all-around\" winner. In this paper, we are interested in revealing the undiscovered potential of the Liberation codes, since these codes have the following attractive features: (a) they have the best update performance, (b) they have better scalability, and (c) they are open-sourced and publicly available, as well as the following drawbacks: fair encoding performance and, more importantly, relatively poor decoding performance. Specificly, we present novel optimal encoding and decoding algorithms for the Liberation codes by introducing an alternative, geometric presentation of these codes. The proposed algorithms completely eliminate redundant computations during the encoding and decoding procedures by extracting and reusing common expressions between the two types of parity constraints, and do not involve any matrix operations on which the original algorithms are based. Our experiment results show that compared with the original solution, the proposed encoding and decoding algorithms reduce the number of XOR’s by up to 16 percent and 15 ~20 percent respectively, and the encoding and decoding throughputs are increased by 22.3 percent and at most 155 percent respectively. Moreover, the encoding complexity reaches the theoretical lower bound, while the decoding complexity is also very close to the theoretical lower bound.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"708-717"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75436775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1