2020 IEEE High Performance Extreme Computing Conference (HPEC)最新文献_第9页

A Novel Inference Algorithm for Large Sparse Neural Network using Task Graph Parallelism 一种基于任务图并行性的大型稀疏神经网络推理算法

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286218

Dian-Lun Lin, Tsung-Wei Huang

{"title":"A Novel Inference Algorithm for Large Sparse Neural Network using Task Graph Parallelism","authors":"Dian-Lun Lin, Tsung-Wei Huang","doi":"10.1109/HPEC43674.2020.9286218","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286218","url":null,"abstract":"The ever-increasing size of modern deep neural network (DNN) architectures has put increasing strain on the hardware needed to implement them. Sparsified DNNs can greatly reduce memory costs and increase throughput over standard DNNs, if the loss of accuracy can be adequately controlled. However, sparse DNNs present unique computational challenges. Efficient model or data parallelism algorithms are extremely hard to design and implement. The recent effort MIT/IEEE/Amazon HPEC Graph Challenge has drawn attention to high-performance inference methods for large sparse DNNs. In this paper, we introduce SNIG, an efficient inference engine for large sparse DNN s. SNIG develops highly optimized inference kernels and leverages the power of CUDA Graphs to enable efficient decomposition of model and data parallelisms. Our decomposition strategy is flexible and scalable to different partitions of data volumes, model sizes, and GPU numbers. We have evaluated SNIG on the official benchmarks of HPEC Sparse DNN Challenge and demonstrated its promising performance scalable from a single GPU to multiple GPUs. Compared to the champion of the 2019 HPEC Sparse DNN Challenge, SNIG can finish all inference workloads using only a single GPU. At the largest DNN, which has more than 4 billion parameters across 1920 layers each of 65536 neurons, SNIG is up to 2.3x faster than a state-of-the-art baseline under a machine of 4 GPUs.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116624220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Variable Precision Multiplication for Software-Based Neural Networks 基于软件的神经网络变精度乘法

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286170

Richa Singh, Thomas Conroy, P. Schaumont

{"title":"Variable Precision Multiplication for Software-Based Neural Networks","authors":"Richa Singh, Thomas Conroy, P. Schaumont","doi":"10.1109/HPEC43674.2020.9286170","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286170","url":null,"abstract":"As the number of applications of neural networks continues to grow, so does the need to efficiently perform inference computations on highly constrained devices. In this paper, we propose a methodology to accelerate neural networks in software. We exploit the limited-precision requirements of typical neural networks by formulating recurring operations in a bit-slice computation format. Bit-slice computation ensures that every bit of an $M$-bit processor word contributes useful work even while computing a limited-precision n-bit (with $n$ < M) operation. This paper brings the following contributions. We first present an environment to efficiently create bitslice descriptions in software, by synthesizing them from Verilog. We then develop bitsliced designs of matrix multiplication and evaluate their performance. Our target is a small microcontroller, and we rely solely on software optimization. Our driving application is a neural network classifier for the MNIST database. Range-Based Linear Quantization in symmetric mode quantizes pre-trained 32-bit floating point weights and activation to low-precision data-widths. Experiments on RISC-V with varying levels of hardware-support show that for data-widths common to neural network applications, the bit-sliced code produces a speedup over traditional methods, which leads to faster and efficient inference without incurring significant loss in accuracy. For example, 8-bit matrix multiplications are sped up by a factor of 2.62× when compared with non-bitsliced rv32i ISA implementation with no hardware multiplier.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123280466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Fault Tolerant Implementation for a Massively Parallel Seismic Framework 大规模并行地震框架的容错实现

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286143

Suha N. Kayum, H. Alsalim, T. Tonellot, A. Momin

引用次数: 1

Work-Efficient Parallel Algorithms for Accurate Floating-Point Prefix Sums 精确浮点前缀和的高效并行算法

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286240

Sean Fraser, Helen Xu, C. Leiserson

引用次数: 1

GBTLX: A First Look 第一眼

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286231

Sanil Rao, Anurag Kutuluru, Paul Brouwer, Scott McMillan, F. Franchetti

{"title":"GBTLX: A First Look","authors":"Sanil Rao, Anurag Kutuluru, Paul Brouwer, Scott McMillan, F. Franchetti","doi":"10.1109/HPEC43674.2020.9286231","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286231","url":null,"abstract":"We provide a first look at GBTLX, a code generator that translates graph processing programs written using the GraphBLAS Template Library (GBTL) into high-performance C programs that match the performance of hand-tuned implementations. GBTLX refactors code written using GBTL into problems that capture the signature of algorithms and solvers that capture the semantics (input/output behavior of algorithms. Users provide classes that implement these two aspects using standard GBTL functions and encapsulate the targeted algorithm. GBTLX then performs a sequence of inspection, code generation, and high-performance execution. First, the user code is traced while running with the original GBTL. Then, the trace is used to define the semantics and signature of the algorithm to be produced in code generation. The SPIRAL system is used to generate high-performance C code that implements the user-specified algorithm, specializing the code for algorithm and hardware-dependent optimizations. Finally, the user-provided GBTL-based implementation is replaced by the SPIRAL generated C code. For triangle counting and k-truss enumeration, the resulting executables provide performance equivalent to hand-tuned implementations, while the source code is maintainable as it only uses the C++ GBTL library.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130311105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs AMD gpu上密集线性代数算法的设计、优化和基准测试

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286214

Cade Brown, A. Abdelfattah, S. Tomov, J. Dongarra

引用次数: 10

[Copyright notice] (版权)

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/hpec43674.2020.9286160

引用次数: 0

An Efficient LP Rounding Scheme for Replica Placement 一种用于副本放置的高效LP舍入方案

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286163

Zhihui Du, Sen Zhang, David A. Bader, Jingkun Hu

引用次数: 0

Half-Precision Floating-Point Formats for PageRank: Opportunities and Challenges 半精度浮点格式的PageRank:机遇与挑战

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286179

A. S. Molahosseini, H. Vandierendonck

引用次数: 3

A Hybrid-Pipelined Architecture for FPGA-based Binary Weight DenseNet with High Performance-Efficiency 一种高性能的基于fpga的二元权密度网混合流水线结构

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286185

Shihao Zeng, Yihua Huang

引用次数: 1