{"title":"A Novel Inference Algorithm for Large Sparse Neural Network using Task Graph Parallelism","authors":"Dian-Lun Lin, Tsung-Wei Huang","doi":"10.1109/HPEC43674.2020.9286218","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286218","url":null,"abstract":"The ever-increasing size of modern deep neural network (DNN) architectures has put increasing strain on the hardware needed to implement them. Sparsified DNNs can greatly reduce memory costs and increase throughput over standard DNNs, if the loss of accuracy can be adequately controlled. However, sparse DNNs present unique computational challenges. Efficient model or data parallelism algorithms are extremely hard to design and implement. The recent effort MIT/IEEE/Amazon HPEC Graph Challenge has drawn attention to high-performance inference methods for large sparse DNNs. In this paper, we introduce SNIG, an efficient inference engine for large sparse DNN s. SNIG develops highly optimized inference kernels and leverages the power of CUDA Graphs to enable efficient decomposition of model and data parallelisms. Our decomposition strategy is flexible and scalable to different partitions of data volumes, model sizes, and GPU numbers. We have evaluated SNIG on the official benchmarks of HPEC Sparse DNN Challenge and demonstrated its promising performance scalable from a single GPU to multiple GPUs. Compared to the champion of the 2019 HPEC Sparse DNN Challenge, SNIG can finish all inference workloads using only a single GPU. At the largest DNN, which has more than 4 billion parameters across 1920 layers each of 65536 neurons, SNIG is up to 2.3x faster than a state-of-the-art baseline under a machine of 4 GPUs.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116624220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Variable Precision Multiplication for Software-Based Neural Networks","authors":"Richa Singh, Thomas Conroy, P. Schaumont","doi":"10.1109/HPEC43674.2020.9286170","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286170","url":null,"abstract":"As the number of applications of neural networks continues to grow, so does the need to efficiently perform inference computations on highly constrained devices. In this paper, we propose a methodology to accelerate neural networks in software. We exploit the limited-precision requirements of typical neural networks by formulating recurring operations in a bit-slice computation format. Bit-slice computation ensures that every bit of an $M$-bit processor word contributes useful work even while computing a limited-precision n-bit (with $n$ < M) operation. This paper brings the following contributions. We first present an environment to efficiently create bitslice descriptions in software, by synthesizing them from Verilog. We then develop bitsliced designs of matrix multiplication and evaluate their performance. Our target is a small microcontroller, and we rely solely on software optimization. Our driving application is a neural network classifier for the MNIST database. Range-Based Linear Quantization in symmetric mode quantizes pre-trained 32-bit floating point weights and activation to low-precision data-widths. Experiments on RISC-V with varying levels of hardware-support show that for data-widths common to neural network applications, the bit-sliced code produces a speedup over traditional methods, which leads to faster and efficient inference without incurring significant loss in accuracy. For example, 8-bit matrix multiplications are sped up by a factor of 2.62× when compared with non-bitsliced rv32i ISA implementation with no hardware multiplier.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123280466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Fault Tolerant Implementation for a Massively Parallel Seismic Framework","authors":"Suha N. Kayum, H. Alsalim, T. Tonellot, A. Momin","doi":"10.1109/HPEC43674.2020.9286143","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286143","url":null,"abstract":"An increase in the acquisition of seismic data volumes has resulted in applications processing seismic data running for weeks or months on large supercomputers. A fault occurring during processing would jeopardize the fidelity and quality of the results, hence necessitating a resilient application. GeoDRIVE is a High-Performance Computing (HPC) software framework tailored to massive seismic applications and supercomputers. A fault tolerance mechanism that capitalizes on Boost.asio for network communication is presented and tested quantitatively and qualitatively by simulating faults using fault injection. Resource provisioning is also illustrated by adding more resources to a job during simulation. Finally, a large-scale job of 2,500 seismic experiments and 358 billion grid elements is executed on 32,000 cores. Subsets of nodes are killed at different times, validating the resilience of the mechanism in large scale. While the implementation is demonstrated in a seismic application context, it can be tailored to any HPC application with embarrassingly parallel properties.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123625938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Work-Efficient Parallel Algorithms for Accurate Floating-Point Prefix Sums","authors":"Sean Fraser, Helen Xu, C. Leiserson","doi":"10.1109/HPEC43674.2020.9286240","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286240","url":null,"abstract":"Existing work-efficient parallel algorithms for floating-point prefix sums exhibit either good performance or good numerical accuracy, but not both. Consequently, prefix-sum algorithms cannot easily be used in scientific-computing applications that require both high performance and accuracy. We have designed and implemented two new algorithms, called CAST _BLK and PAIR_BLK, whose accuracy is significantly higher than that of the high-performing prefix-sum algorithm from the Problem Based Benchmark Suite, while running with comparable performance on modern multicore machines. Specifically, the root mean squared error of the PBBS code on a large array of uniformly distributed 64-bit floating-point numbers is 8 times higher than that of CAST _BLK and 5.8 times higher than that of PAIR_BLK. These two codes employ the PBBS three-stage strategy for performance, but they are designed to achieve high accuracy, both theoretically and in practice. A vectorization enhancement to these two scalar codes trades off a small amount of accuracy to match or outperform the PBBS code while still maintaining lower error.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132232247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sanil Rao, Anurag Kutuluru, Paul Brouwer, Scott McMillan, F. Franchetti
{"title":"GBTLX: A First Look","authors":"Sanil Rao, Anurag Kutuluru, Paul Brouwer, Scott McMillan, F. Franchetti","doi":"10.1109/HPEC43674.2020.9286231","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286231","url":null,"abstract":"We provide a first look at GBTLX, a code generator that translates graph processing programs written using the GraphBLAS Template Library (GBTL) into high-performance C programs that match the performance of hand-tuned implementations. GBTLX refactors code written using GBTL into problems that capture the signature of algorithms and solvers that capture the semantics (input/output behavior of algorithms. Users provide classes that implement these two aspects using standard GBTL functions and encapsulate the targeted algorithm. GBTLX then performs a sequence of inspection, code generation, and high-performance execution. First, the user code is traced while running with the original GBTL. Then, the trace is used to define the semantics and signature of the algorithm to be produced in code generation. The SPIRAL system is used to generate high-performance C code that implements the user-specified algorithm, specializing the code for algorithm and hardware-dependent optimizations. Finally, the user-provided GBTL-based implementation is replaced by the SPIRAL generated C code. For triangle counting and k-truss enumeration, the resulting executables provide performance equivalent to hand-tuned implementations, while the source code is maintainable as it only uses the C++ GBTL library.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130311105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs","authors":"Cade Brown, A. Abdelfattah, S. Tomov, J. Dongarra","doi":"10.1109/HPEC43674.2020.9286214","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286214","url":null,"abstract":"Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Therefore, in this paper we investigate the portability of the MAGMA DLA library to the latest AMD GPUs. We use auto tools to convert the CUDA code in MAGMA to the Heterogeneous-Computing Interface for Portability (HIP) language. MAGMA provides LAPACK for GPUs and benchmarks for fundamental DLA routines ranging from BLAS to dense factorizations, linear systems and eigen-problem solvers. We port these routines to HIP and quantify currently achievable performance through the MAGMA benchmarks for the main workload algorithms on MI25 and MI50 AMD GPUs. Comparison with performance roofline models and theoretical expectations are used to identify current limitations and directions for future improvements.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127252203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Efficient LP Rounding Scheme for Replica Placement","authors":"Zhihui Du, Sen Zhang, David A. Bader, Jingkun Hu","doi":"10.1109/HPEC43674.2020.9286163","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286163","url":null,"abstract":"Large fault-tolerant network systems with high Quality of Service (QoS) guarantee are critical in many real world applications and entail diverse replica placement problems. In this paper, the replica placement problem in terms of minimizing the replica placement cost subject to both QoS and fault-tolerant constraints is formulated as a binary integer linear programming problem first and then relaxed as a linear programming problem. Given the optimal fractional linear programming solution, we propose a two-step rounding algorithm to obtain its integer solution. In the first step, a half rounding algorithm is used to simplify the problem. In the second step, a cheapest amortized cost rounding algorithm uses a novel metric, named amortized cost, to make locally optimal rounding decision for the remaining vertices independently. Furthermore, a conflict resolution algorithm is presented to tackle the situations when different vertices make conflicting rounding decisions. Finally, we prove that the proposed two-step rounding algorithm has a 2-approximation ratio when the additional conflict cost meets a given constraint.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128880476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Half-Precision Floating-Point Formats for PageRank: Opportunities and Challenges","authors":"A. S. Molahosseini, H. Vandierendonck","doi":"10.1109/HPEC43674.2020.9286179","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286179","url":null,"abstract":"Mixed-precision computation has been proposed as a means to accelerate iterative algorithms as it can reduce the memory bandwidth and cache effectiveness. This paper aims for further memory traffic reduction via introducing new half-precision (16 bit) data formats customized for PageRank. We develop two formats. A first format builds on the observation that the exponents of about 99% of PageRank values are tightly distributed around the exponent of the inverse of the number of vertices. A second format builds on the observation that 6 exponent bits are sufficient to capture the full dynamic range of PageRank values. Our floating-point formats provide less precision compared to standard IEEE 754 formats, but sufficient dynamic range for PageRank. The experimental results on various size graphs show that the proposed formats can achieve an accuracy of le-4., which is an improvement over the state of the art. Due to random memory access patterns in the algorithm, performance improvements over our highly tuned baseline are 1.5% at best.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126922624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Hybrid-Pipelined Architecture for FPGA-based Binary Weight DenseNet with High Performance-Efficiency","authors":"Shihao Zeng, Yihua Huang","doi":"10.1109/HPEC43674.2020.9286185","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286185","url":null,"abstract":"The DenseNet achieves remarkable performance in various computer vision tasks with much fewer parameters and operations. However, there are few acceleration designs about the DenseNet, due to its dense-connectivity structure. In this paper, we apply the binary weight method on the DenseNet and then propose a hybrid-pipelined architecture for FPGA-based acceleration of the binary weight DenseNet, which can be stored entirely in a chip. To deal with the dense-connectivity, a reusable convolution unit is developed to support conv1×1 and conv3×3 efficiently. Moreover, a theoretical method of system parallelism is proposed to guide the top-level pipelined design for the maximum efficiency. To evaluate the proposed architecture, the binary weight DenseNet-100 model is trained on CIFAR10 dataset and then implemented on VX690T FPGA, at the cost of 4.18% accuracy loss. The experiment demonstrates that our architecture can achieve the throughput of 514 GOPS and 889 FPS at 200MHz, and the performance-efficiency is up to 62.4%, which outperforms the most related works.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121853513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}