P. Springer, Thomas Schibler, Géraud Krawezik, J. Lightholder, P. Kogge
{"title":"Machine Learning Algorithm Performance on the Lucata Computer","authors":"P. Springer, Thomas Schibler, Géraud Krawezik, J. Lightholder, P. Kogge","doi":"10.1109/HPEC43674.2020.9286158","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286158","url":null,"abstract":"A new parallel computing paradigm has recently become available, one that combines a PIM (processor in memory) architecture with the use of many lightweight threads, where each thread migrates automatically to the memory used by that thread. Our effort focuses on producing performance gains on this architecture for a key machine learning algorithm, Random Forest, that are at least linear in proportion to the number of cores. Beyond that, we show that a data distribution that groups test samples and trees by feature improves run times by a factor more than double the number of cores in the machine.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121289885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Flikkema, James Palmer, Tolga Yalçin, B. Cambou
{"title":"Dynamic Computational Diversity with Multi-Radix Logic and Memory","authors":"P. Flikkema, James Palmer, Tolga Yalçin, B. Cambou","doi":"10.1109/HPEC43674.2020.9286255","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286255","url":null,"abstract":"Today's computing systems are highly vulnerable to attacks, in large part because nearly all computers are part of a hardware and software monoculture of machines in its market, industry or sector. This is of special concern in mission-critical networked systems upon which our civil, industrial, and defense infrastructures increasingly rely. One approach to tackle this challenge is to endow these systems with dynamic computational diversity, wherein each processor assumes a sequence of unique variants, such that it executes only machine code encoded for a variant during the time interval of that variant's existence. The variants are drawn from a very large set, all adhering to a computational diversity architecture, which is based on an underlying instruction set architecture. Thus any population of machines belonging to a specific diversity architecture consists of a temporally dynamic set of essentially-unique variants. However, an underlying ISA enables development of a common development toolchain for the diversity architecture. Our approach is hardware-centric, relying on the rapidly developing microelectronics technologies of ternary computing, resistive RAM (ReRAM) memory, and physical unclonable functions. This paper describes our on-going work in dynamic computational diversity, which targets the principled design of a secure processor for embedded applications.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126726540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Profiling and Optimization of CT Reconstruction on Nvidia Quadro GV100","authors":"S. Dwivedi, Andreas Heumann","doi":"10.1109/HPEC43674.2020.9286223","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286223","url":null,"abstract":"Computed Tomography (CT) Imaging is a widely used technique for medical and industrial applications. Iterative reconstruction algorithms are desired for improved reconstructed image quality and lower dose, but its computational requirements limit its practical usage. Reconstruction toolkit (RTK) is a package of open source GPU accelerated algorithms for CBCT (cone beam computed tomography). GPU based iterative algorithms gives immense acceleration, but it may not be optimized to use the GPU resources efficiently. Nvidia has released several profilers (Nsight-systems, Nsight-compute) to analyze the GPU implementation of an algorithm from compute utilization and memory efficiency perspective. This paper profiles and analyzes the GPU implementation of iterative FDK algorithm in RTK and optimizes it for computation and memory usage on a Quadro GV100 GPU with 32 GB of memory and over 5000 cuda cores. RTK based GPU accelerated iterative FDK when applied on a 4 byte per pixel input projection dataset of size 1.1 GB (512×512×1024) for 20 iterations, to reconstruct a volume of size 440 MB (512×512×441) with 4 byte per pixel, resulted in total runtime of ~11.2 seconds per iteration. Optimized RTK based iterative FDK presented in this paper took ~1.3 seconds per iteration.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122253646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Throughput Image Alignment for Connectomics using Frugal Snap Judgments","authors":"Tim Kaler, Brian Wheatman, Sarah Wooders","doi":"10.1109/HPEC43674.2020.9286243","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286243","url":null,"abstract":"The accuracy and computational efficiency of image alignment directly affects the advancement of connectomics, a field which seeks to understand the structure of the brain through electron microscopy. We introduce the algorithms Quilter and Stacker that are designed to perform 2D and 3D alignment respectively on petabyte-scale data sets from connectomics. Quilter and Stacker are efficient, scalable, and can run on hardware ranging from a researcher's laptop to a large computing cluster. On a single 18-core cloud machine each algorithm achieves throughputs of more than 1 TB/hr; when combined the algorithms produce an end-to-end alignment pipeline that processes data at a rate of 0.82 TB/hr - an over 10x improvement from previous systems. This efficiency comes from both traditional optimizations and from the use of “Frugal Snap Judgments” to judiciously exploit performance-accuracy trade-offs. A high-throughput image-alignment pipeline was implemented using the Quilter and Stacker algorithms and its performance was evaluated using three datasets whose size ranged from 550 GB to 38 TB. The full alignment pipeline achieved a throughput of 0.6-0.8 TB/hr and 1.4-1.5 TB/hr on an 18-core and 112-core shared-memory multicore, respectively. On a supercomputing cluster with 200 nodes and 1600 total cores, the pipeline achieved a throughput of 21.4 TB/hr. We introduce the algorithms Quilter and Stacker that are designed to perform 2D and 3D alignment respectively on petabyte-scale data sets from connectomics. Quilter and Stacker are efficient, scalable, and can run on hardware ranging from a researcher's laptop to a large computing cluster. On a single 18-core cloud machine each algorithm achieves throughputs of more than 1 TB/hr; when combined the algorithms produce an end-to-end alignment pipeline that processes data at a rate of 0.82 TB/hr - an over 10x improvement from previous systems. This efficiency comes from both traditional optimizations and from the use of “Frugal Snap Judgments” to judiciously exploit performance-accuracy trade-offs. A high-throughput image-alignment pipeline was implemented using the Quilter and Stacker algorithms and its performance was evaluated using three datasets whose size ranged from 550 GB to 38 TB. The full alignment pipeline achieved a throughput of 0.6-0.8 TB/hr and 1.4-1.5 TB/hr on an 18-core and 112-core shared-memory multicore, respectively. On a supercomputing cluster with 200 nodes and 1600 total cores, the pipeline achieved a throughput of 21.4 TB/hr. A high-throughput image-alignment pipeline was implemented using the Quilter and Stacker algorithms and its performance was evaluated using three datasets whose size ranged from 550 GB to 38 TB. The full alignment pipeline achieved a throughput of 0.6-0.8 TB/hr and 1.4-1.5 TB/hr on an 18-core and 112-core shared-memory multicore, respectively. On a supercomputing cluster with 200 nodes and 1600 total cores, the pipeline achieved a throughput of 21.4 TB/hr.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128209177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maarten Hattink, G. D. Guglielmo, L. Carloni, K. Bergman
{"title":"A Scalable Architecture for CNN Accelerators Leveraging High-Performance Memories","authors":"Maarten Hattink, G. D. Guglielmo, L. Carloni, K. Bergman","doi":"10.1109/HPEC43674.2020.9286162","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286162","url":null,"abstract":"As FPGA-based accelerators become ubiquitous and more powerful, the demand for integration with High-Performance Memory (HPM) grows. Although HPMs offer a much greater bandwidth than standard DDR4 DRAM, they introduce new design challenges such as increased latency and higher bandwidth mismatch between memory and FPGA cores. This paper presents a scalable architecture for convolutional neural network accelerators conceived specifically to address these challenges and make full use of the memory's high bandwidth. The accelerator, which was designed using high-level synthesis, is highly configurable. The intrinsic parallelism of its architecture allows near-perfect scaling up to saturating the available memory bandwidth.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133828585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating SEU Resilience of CNNs with Fault Injection","authors":"Evan T. Kain, Tyler M. Lovelly, A. George","doi":"10.1109/HPEC43674.2020.9286168","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286168","url":null,"abstract":"Convolutional neural networks (CNNs) are quickly growing as a solution for advanced image processing in many mission-critical high-performance and embedded computing systems ranging from supercomputers and data centers to aircraft and spacecraft. However, the systems running CNNs are increasingly susceptible to single-event upsets (SEUs) which are bit flips that result from charged particle strikes. To better understand how to mitigate the effects of SEUs on CNNs, the behavior of CNNs when exposed to SEUs must be better understood. Software fault-injection tools allow us to emulate SEUs to analyze the effects of various CNN architectures and input data features on overall resilience. Fault injection on three combinations of CNNs and datasets yielded insights into their behavior. When focusing on a threshold of 1% error in classification accuracy, more complex CNNs tended to be less resilient to SEUs, and easier classification tasks on well-clustered input data were more resilient to SEUs. Overall, the number of bits flipped to reach this threshold ranged from 20 to 3,790 bits. Results demonstrate that CNNs are highly resilient to SEUs, but the complexity of the CNN and difficulty of the classification task will decrease that resilience.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133298381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rushi Patel, Pierre-Francois W. Wolfe, Robert Munafo, Mayank Varia, Martin C. Herbordt
{"title":"Arithmetic and Boolean Secret Sharing MPC on FPGAs in the Data Center","authors":"Rushi Patel, Pierre-Francois W. Wolfe, Robert Munafo, Mayank Varia, Martin C. Herbordt","doi":"10.1109/HPEC43674.2020.9286159","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286159","url":null,"abstract":"Multi-Party Computation (MPC) is an important technique used to enable computation over confidential data from several sources. The public cloud provides a unique opportunity to enable MPC in a low latency environment. Field Programmable Gate Array (FPGA) hardware adoption allows for both MPC acceleration and utilization of low latency, high bandwidth communication networks that substantially improve the performance of MPC applications. In this work, we show how designing arithmetic and Boolean Multi-Party Computation gates for FPGAs in a cloud provide improvements to current MPC offerings and ease their use in applications such as machine learning. We focus on the usage of Secret Sharing MPC first designed by Araki et al [1] to design our FPGA MPC while also providing a comparison with those utilizing Garbled Circuits for MPC. We show that Secret Sharing MPC provides a better usage of cloud resources, specifically FPGA acceleration, than Garbled Circuits and is able to use at least a 10 x less computer resources as compared to the original design using CPUs.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126466179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Ivanovich, Chenfeng Zhao, Xuan Zhang, R. Chamberlain, A. Deliwala, V. Gruev
{"title":"Chip-to-chip Optical Data Communications using Polarization Division Multiplexing","authors":"D. Ivanovich, Chenfeng Zhao, Xuan Zhang, R. Chamberlain, A. Deliwala, V. Gruev","doi":"10.1109/HPEC43674.2020.9286227","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286227","url":null,"abstract":"Short distance optical communication is challenging in significant part because the expense of constructing effective systems is high. We describe an optical data communication system that is designed to operate over very short distances (neighboring chips on a board) and is compatible with traditional CMOS fabrication, substantially decreasing the cost to build relative to previous approaches. Polarization division multiplexing is exploited to increase the achievable data rates.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125368171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Post Quantum Cryptography(PQC) - An overview: (Invited Paper)","authors":"M. Kumar, P. Pattnaik","doi":"10.1109/HPEC43674.2020.9286147","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286147","url":null,"abstract":"We discuss the Post Quantum Cryptography algorithms for key establishment under consideration by NIST for standardization. Three of these, Crystals- Kyber, Classic McEliece and Supersingular Isogeny based Key Encapsulation (SIKE), are representatives of the three classes of hard problems underlying the security of almost all 69 candidate algorithms accepted by NIST for consideration in round 1 of evaluation. For each algorithm, we briefly describe the hard problem underlying the algorithm's cryptographic strength, the algebraic structure i.e., the groups or finite fields, underlying the computations, the basic computations performed in these algorithms, the algorithm itself, and the performance considerations for efficient implementation of the basic algorithm on conventional many-core processors. For Crystals- Kyber and SIKE, we will discuss the potential solutions to improve their performance on many-core processors.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114874666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design and Performance Evaluation of Optimizations for OpenCL FPGA Kernels","authors":"A. Cabrera, R. Chamberlain","doi":"10.1109/HPEC43674.2020.9286221","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286221","url":null,"abstract":"The use of FPGAs in heterogeneous systems are valuable because they can be used to architect custom hardware to accelerate a particular application or domain. However, they are notoriously difficult to program. The development of high level synthesis tools like OpenCL make FPGA development more accessible, but not without its own challenges. The synthesized hardware comes from a description that is semantically closer to the application, which leaves the underlying hardware implementation unclear. Moreover, the interaction of the hardware tuning knobs exposed using a higher level specification increases the challenge of finding the most performant hardware configuration. In this work, we address these aforementioned challenges by describing how to approach the design space, using both information from the literature as well as by describing a methodology to better visualize the resulting hardware from the high level specification. Finally, we present an empirical evaluation of the impact of vectorizing data types as a tunable knob and its interaction among other coarse-grained hardware knobs.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127650867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}