2019 IEEE High Performance Extreme Computing Conference (HPEC)最新文献_第9页

Embedded GPU Cluster Computing Framework for Inference of Convolutional Neural Networks 卷积神经网络推理的嵌入式GPU集群计算框架

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916253

Evan T. Kain, Diego Wildenstein, A. Pineda

引用次数: 2

Improving Parallelism of Breadth First Search (BFS) Algorithm for Accelerated Performance on GPUs 改进广度优先搜索(BFS)算法的并行性以提高gpu性能

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916551

Hao Wen, W. Zhang

{"title":"Improving Parallelism of Breadth First Search (BFS) Algorithm for Accelerated Performance on GPUs","authors":"Hao Wen, W. Zhang","doi":"10.1109/HPEC.2019.8916551","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916551","url":null,"abstract":"Breadth-first search (BFS) is a basis for graph search and a core building block for many higher-level graph analysis applications. However, BFS is also a typical example of parallel computation that is inefficient on GPU architectures. In a graph, a small portion of nodes may have a large number of neighbors, which leads to irregular tasks on GPUs. In addition, the number of nodes in each layer of the graph is also irregular. Therefore, the number of active GPU threads is different for each layer of execution. These irregularities limit the parallelism of BFS executing on GPUs.Unlike the previous works focusing on fine-grained task management to address the irregularity, we propose Virtual-BFS (VBFS) to virtually change the graph itself. By adding virtual vertices, the high-degree nodes in the graph are divided into groups that have an equal number of neighbors, which increases the parallelism such that more GPU threads can work concurrently, and the data set also becomes more regular.Our experimental results show that the VBFS achieves significant speedup over the current GPU implementation of BFS from the Rodinia benchmark [4], and the energy efficiency is also improved.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131954787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

MeXT: A Flow for Multiprocessor Exploration 下一步:多处理器探索流程

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916428

C. Bobda, H. Ishebabi, Philipp Mahr, Joel Mandebi Mbongue, S. Saha

引用次数: 2

Fast BFS-Based Triangle Counting on GPUs gpu上基于bfs的快速三角形计数

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916434

Leyuan Wang, John Douglas Owens

引用次数: 10

Efficient implementation of sparse matrix-sparse vector multiplication for large scale graph analytics 大规模图分析中稀疏矩阵-稀疏向量乘法的高效实现

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916413

M. Serrano

{"title":"Efficient implementation of sparse matrix-sparse vector multiplication for large scale graph analytics","authors":"M. Serrano","doi":"10.1109/HPEC.2019.8916413","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916413","url":null,"abstract":"We developed a parallel algorithm to improve the cache behavior and overall performance for multiplication of sparse matrices with sparse vectors (SpMSpV), an operation used increasingly in large graph analytics, particularly dynamic graphs in social networks and homeland security applications. The proposed algorithm builds upon the two-phase approach of partitioning the multiplication into a scaling phase and an aggregation phase, to achieve more cache-friendly access patterns individually in each phase [6], [3]. However, to handle dynamic graphs and achieve better load balancing for parallel implementation, we use a combination of private and shared bins, with synchronized access to shared bins to exchange the product terms between the two phases. The new algorithm accumulates product terms in private bins for each thread. The algorithm then performs a bulk transfer between a private bin and a shared bin, when the private bin becomes full. Then results are aggregated from the shared bins. In addition, we employ heuristics to decide the best algorithm for SpMSpV based on the number of nonzeros involved in the operation. When the number of nonzeros is large, it may be better to perform the operation as SpMV (sparse matrix times dense vector) despite the added conversion cost. Also, if the number of nonzeros is low it is advantageous to use a simplified algorithm. We compared our algorithm with existing algorithms for SpMSpV, and our evaluation shows that execution time is reduced by several times when large graphs are considered.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128842052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

An FPGA Decision Tree Classifier to Supervise a Communication SoC 基于FPGA的通信SoC决策树分类器

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916459

Abdelrahman Elkanishy, Derrick T. Rivera, Abdel-Hameed A. Badawy, P. Furth, Z. Saifullah, Christopher P. Michael

{"title":"An FPGA Decision Tree Classifier to Supervise a Communication SoC","authors":"Abdelrahman Elkanishy, Derrick T. Rivera, Abdel-Hameed A. Badawy, P. Furth, Z. Saifullah, Christopher P. Michael","doi":"10.1109/HPEC.2019.8916459","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916459","url":null,"abstract":"Wireless communication protocols are used in all smart devices and systems. This work is part of a proposed supervisory circuit that classifies the operation of a communication SoC, in particular, a Bluetooth (BT) SoC, at a low sampling frequency by monitoring the RF output power and input supply current. In essence, the goal is to inexpensively fabricate an RF envelope detector, power supply current monitor, and classifier on a low-cost, low-frequency integrated circuit. When the supervisory circuit detects abnormal behavior, it can shut off power to the BT chip. We extract simple descriptive features from the input and output power signals. Then, we train a machine learning (ML) model to classify the different BT operation modes, such as advertising and transmit/receive modes. In this work, we implemented the ML classifier and feature extraction on an FPGA with 100% matching with the corresponding MATLAB code. In the experimental setup, which included a function generator and an on-board ADC, errors in the FPGA-sampled values degraded the match slightly to 99.26%. Finally, a low-power ASIC is synthesized from the Verilog code in $0.18-mu mathrm{m}$ CMOS, with an estimated area of 0.0152 mm2 and power of $9.43 mu mathrm{W}$.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129426763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Optimal Resource Allocation for Parallel Reservoir Simulation 并行水库模拟的资源优化分配

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916420

Suha N. Kayum, M. Rogowski

引用次数: 0

Sparse Deep Neural Network Graph Challenge 稀疏深度神经网络图挑战

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916336

J. Kepner, Simon Alford, V. Gadepally, Michael Jones, Lauren Milechin, Ryan A. Robinett, S. Samsi

{"title":"Sparse Deep Neural Network Graph Challenge","authors":"J. Kepner, Simon Alford, V. Gadepally, Michael Jones, Lauren Milechin, Ryan A. Robinett, S. Samsi","doi":"10.1109/HPEC.2019.8916336","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916336","url":null,"abstract":"The MIT/IEEE/Amazon GraphChallenge.org encourages community approaches to developing new solutions for analyzing graphs and sparse data. Sparse AI analytics present unique scalability difficulties. The proposed Sparse Deep Neural Network (DNN) Challenge draws upon prior challenges from machine learning, high performance computing, and visual analytics to create a challenge that is reflective of emerging sparse AI systems. The Sparse DNN Challenge is based on a mathematically well-defined DNN inference computation and can be implemented in any programming environment. Sparse DNN inference is amenable to both vertex-centric implementations and array-based implementations (e.g., using the GraphBLAS.org standard). The computations are simple enough that performance predictions can be made based on simple computing hardware models. The input data sets are derived from the MNIST handwritten letters. The surrounding I/O and verification provide the context for each sparse DNN inference that allows rigorous definition of both the input and the output. Furthermore, since the proposed sparse DNN challenge is scalable in both problem size and hardware, it can be used to measure and quantitatively compare a wide range of present day and future systems. Reference implementations have been implemented and their serial and parallel performance have been measured. Specifications, data, and software are publicly available at GraphChallenge.org.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115176451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Survey and Benchmarking of Machine Learning Accelerators

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-08-29 DOI: 10.1109/HPEC.2019.8916327

A. Reuther, P. Michaleas, Michael Jones, V. Gadepally, S. Samsi, J. Kepner

{"title":"Survey and Benchmarking of Machine Learning Accelerators","authors":"A. Reuther, P. Michaleas, Michael Jones, V. Gadepally, S. Samsi, J. Kepner","doi":"10.1109/HPEC.2019.8916327","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916327","url":null,"abstract":"Advances in multicore processors and accelerators have opened the flood gates to greater exploration and application of machine learning techniques to a variety of applications. These advances, along with breakdowns of several trends including Moore’s Law, have prompted an explosion of processors and accelerators that promise even greater computational and machine learning capabilities. These processors and accelerators are coming in many forms, from CPUs and GPUs to ASICs, FPGAs, and dataflow accelerators. This paper surveys the current state of these processors and accelerators that have been publicly announced with performance and power consumption numbers. The performance and power values are plotted on a scatter graph and a number of dimensions and observations from the trends on this plot are discussed and analyzed. For instance, there are interesting trends in the plot regarding power consumption, numerical precision, and inference versus training. We then select and benchmark two commercially available low size, weight, and power (SWaP) accelerators as these processors are the most interesting for embedded and mobile machine learning inference applications that are most applicable to the DoD and other SWaP constrained users. We determine how they actually perform with real-world images and neural network models, compare those results to the reported performance and power consumption values and evaluate them against an Intel CPU that is used in some embedded applications.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114156484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 132

TapirXLA: Embedding Fork-Join Parallelism into the XLA Compiler in TensorFlow Using Tapir TapirXLA:在TensorFlow中使用Tapir在XLA编译器中嵌入Fork-Join并行性

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-08-29 DOI: 10.1109/HPEC.2019.8916312

T. Schardl, S. Samsi

{"title":"TapirXLA: Embedding Fork-Join Parallelism into the XLA Compiler in TensorFlow Using Tapir","authors":"T. Schardl, S. Samsi","doi":"10.1109/HPEC.2019.8916312","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916312","url":null,"abstract":"This work introduces TapirXLA, a replacement for TensorFlow’s XLA compiler that embeds recursive fork-join parallelism into XLA’s low-level representation of code. Machine-learning applications employ a variety of technologies to improve performance, including compiler technology. But compilers in machine-learning frameworks lack a deep understanding of parallelism, causing them to lose performance by missing optimizations on parallel computation. This work studies how Tapir, a compiler intermediate representation (IR) that embeds parallelism into a mainstream compiler IR, can be incorporated into a compiler for machine learning to remedy this problem. TapirXLA modifies the XLA compiler in TensorFlow to employ the Tapir/LLVM compiler to optimize low-level parallel computation. TapirXLA encodes the parallelism within high-level TensorFlow operations using Tapir’s representation of fork-join parallelism. Furthermore, TapirXLA exposes to the compiler implementations of linear-algebra library routines whose parallel operations are encoded using Tapir’s representation. We compared the performance of TensorFlow using TapirXLA against TensorFlow using an unmodified XLA compiler. On four neural-network benchmarks, TapirXLA speeds up the parallel running time of the network by a geometric-mean multiplicative factor of 30% to 100%, across four CPU architectures.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130118481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7