2017 IEEE High Performance Extreme Computing Conference (HPEC)最新文献_第5页

Investigating TI KeyStone II and quad-core ARM Cortex-A53 architectures for on-board space processing 研究用于机载空间处理的TI KeyStone II和四核ARM Cortex-A53架构

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091094

B. Schwaller, B. Ramesh, A. George

{"title":"Investigating TI KeyStone II and quad-core ARM Cortex-A53 architectures for on-board space processing","authors":"B. Schwaller, B. Ramesh, A. George","doi":"10.1109/HPEC.2017.8091094","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091094","url":null,"abstract":"Future space missions require reliable architectures with higher performance and lower power consumption. Exploring new architectures worthy of undergoing the expensive and time-consuming process of radiation hardening is critical for this endeavor. Two such architectures are the Texas Instruments KeyStone II octal-core processor and the ARM® Cortex®-A53 (ARMv8) quad-core CPU. DSPs have been proven in prior space applications, and the KeyStone II has eight high-performance DSP cores and is under consideration for potential hardening for space. Meanwhile, a radiation-hardened quad-core ARM Cortex-A53 CPU is under development at Boeing under the NASA/AFRL High-Performance Spaceflight Computing initiative. In this paper, we optimize and evaluate the performance of batched 1D-FFTs, 2D-FFTs, and the Complex Ambiguity Function (CAF). We developed a direct memory-access scheme to take advantage of the complex KeyStone architecture for FFTs. Our results for batched 1D-FFTs show that the performance per Watt of KeyStone II is 4.5 times better than the ARM Cortex-A53. For CAF, our results show that the KeyStone II is 1.7 times better.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"137 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128700706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Towards numerical benchmark for half-precision floating point arithmetic 半精度浮点运算的数值基准

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091031

P. Luszczek, J. Kurzak, I. Yamazaki, J. Dongarra

引用次数: 11

Application of convolutional neural networks on Intel® Xeon® processor with integrated FPGA 卷积神经网络在Intel®Xeon®集成FPGA处理器上的应用

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091025

Philip Colangelo, Enno Lübbers, Randy Huang, M. Margala, Kevin Nealis

{"title":"Application of convolutional neural networks on Intel® Xeon® processor with integrated FPGA","authors":"Philip Colangelo, Enno Lübbers, Randy Huang, M. Margala, Kevin Nealis","doi":"10.1109/HPEC.2017.8091025","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091025","url":null,"abstract":"Intel®'s Xeon® processor with integrated FPGA is a new research platform that provides all the capabilities of a Broadwell Xeon Processor with the added functionality of an Arria 10 FPGA in the same package. In this paper, we present an implementation on this platform to showcase the abilities and effectiveness of utilizing both hardware architectures to accelerate a convolutional based neural network (CNN). We choose a network topology that uses binary weights and low precision activation data to take advantage of the available customizable fabric provided by the FPGA. Further, compared to standard multiply accumulate CNN's, binary weighted networks (BWN) reduce the amount of computation by eliminating the need for multiplication resulting in little to no classification accuracy degradation. Coupling Intel's Open Programmable Acceleration Engine (OPAE) with Caffe provides a robust framework that was used as the foundation for our application. Due to the convolution primitives taking the most computation in our network, we offload the feature and weight data to a customized binary convolution accelerator loaded in the FPGA. Employing the low latency Quick Path Interconnect (QPI) that bridges the Broadwell Xeon processor and Arria 10 FPGA, we can carry out fine-grained offloads while avoiding bandwidth bottlenecks. An initial proof of concept design showcasing this new platform that utilizes only a portion of the FPGA core logic exemplifies that by using both the Xeon processor and FPGA together we can improve the throughput by 2× on some layers and by 1.3× overall.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131791530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

WCET analysis of the shared data cache in integrated CPU-GPU architectures 集成CPU-GPU架构中共享数据缓存的WCET分析

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091059

Y. Huangfu, Wei Zhang

引用次数: 1

Power-aware computing: Measurement, control, and performance analysis for Intel Xeon Phi 功耗感知计算:Intel Xeon Phi的测量、控制和性能分析

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091085

A. Haidar, Heike Jagode, A. YarKhan, Phil Vaccaro, S. Tomov, J. Dongarra

引用次数: 16

Fast linear algebra-based triangle counting with KokkosKernels 快速线性代数为基础的三角形计数与KokkosKernels

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091043

Michael M. Wolf, Mehmet Deveci, Jonathan W. Berry, S. Hammond, S. Rajamanickam

引用次数: 61

Triangle counting for scale-free graphs at scale in distributed memory 在分布式内存中按比例计算无比例图的三角形计数

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091051

R. Pearce

引用次数: 57

Sparse matrix assembly on the GPU through multiplication patterns 稀疏矩阵在GPU上通过乘法模式组装

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091057

Rhaleb Zayer, M. Steinberger, H. Seidel

{"title":"Sparse matrix assembly on the GPU through multiplication patterns","authors":"Rhaleb Zayer, M. Steinberger, H. Seidel","doi":"10.1109/HPEC.2017.8091057","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091057","url":null,"abstract":"The numerical treatment of variational problems gives rise to large sparse matrices, which are typically assembled by coalescing elementary contributions. As the explicit matrix form is required by numerical solvers, the assembly step can be a potential bottleneck, especially in implicit and time dependent settings where considerable updates are needed. On standard HPC platforms, this process can be vectorized by taking advantage of additional mesh querying data structures. However, on graphics hardware, vectorization is inhibited by limited memory resources. In this paper, we propose a lean unstructured mesh representation, which allows casting the assembly problem as a sparse matrix-matrix multiplication. We demonstrate how the global graph connectivity of the assembled matrix can be captured through basic linear algebra operations and show how local interactions between nodes/degrees of freedom within an element can be encoded by means of concise representation, action maps. These ideas not only reduce the memory storage requirements but also cut down on the bulk of data that needs to be moved from global storage to the compute units, which is crucial on parallel computing hardware, and in particular on the GPU. Furthermore, we analyze the effect of mesh memory layout on the assembly performance.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122893597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Ultra-high fidelity radio frequency propagation modeling using distributed high performance graphical processing units: A simulator for multi-element non-stationary antenna systems 使用分布式高性能图形处理单元的超高保真无线电频率传播建模:多单元非固定天线系统的模拟器

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091082

Mark D. Barnell, Nathan Stokes, Jason Steeger, Jessie Grabowski

{"title":"Ultra-high fidelity radio frequency propagation modeling using distributed high performance graphical processing units: A simulator for multi-element non-stationary antenna systems","authors":"Mark D. Barnell, Nathan Stokes, Jason Steeger, Jessie Grabowski","doi":"10.1109/HPEC.2017.8091082","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091082","url":null,"abstract":"A newly-invented, distributed, high-performance graphical processing framework that simulates complex radio frequency (RF) propagation has been developed and demonstrated. The approach uses an advanced computer architecture and intensive multi-core system to enable highperformance data analysis at the fidelity necessary to design and develop modern sensor systems. This widely applicable simulation and modeling technology aids in the design and development of state-of-the-art systems with complex waveforms and more advanced downstream exploitation techniques, e.g., systems with arbitrary RF waveforms, higher RF bandwidths and increasing resolution. The recent breakthroughs in computing hardware, software, systems and applications has enabled these concepts to be tested and demonstrated in a large variety of environments and early in the design cycle. Improvements in simulation accuracies and simulation timescales have been made that immediately increase the value to the end user. A near-analytic RF propagation model increased the computational need by orders of magnitude. This model also increased required numerical precision. The new general purpose graphics processing units (GPGPUs) provided the capability to simulate the propagation effects and model it with the necessary information dependence, and floating point mathematics where performance matters. The relative performance improvement between the baseline MATLAB® parallelized simulation and the equivalent GPU based simulation using 12 NVIDIA Tesla K20m GPUs on the Offspring High-Performance Computer (HPC) using the AirWASP© framework decreased simulation and modeling from 16.5 days to less than 1 day.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116873088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A cloud-based brain connectivity analysis tool 基于云的大脑连接分析工具

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091080

L. Brattain, Mihnea Bulugioiu, Adam Brewster, Mark Hernandez, Heejin Choi, T. Ku, Kwanghun Chung, V. Gadepally

{"title":"A cloud-based brain connectivity analysis tool","authors":"L. Brattain, Mihnea Bulugioiu, Adam Brewster, Mark Hernandez, Heejin Choi, T. Ku, Kwanghun Chung, V. Gadepally","doi":"10.1109/HPEC.2017.8091080","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091080","url":null,"abstract":"With advances in high throughput brain imaging at the cellular and sub-cellular level, there is growing demand for platforms that can support high performance, large-scale brain data processing and analysis. In this paper, we present a novel pipeline that combines Accumulo, D4M, geohashing, and parallel programming to manage large-scale neuron connectivity graphs in a cloud environment. Our brain connectivity graph is represented using vertices (fiber start/end nodes), edges (fiber tracks), and the 3D coordinates of the fiber tracks. For optimal performance, we take the hybrid approach of storing vertices and edges in Accumulo and saving the fiber track 3D coordinates in flat files. Accumulo database operations offer low latency on sparse queries while flat files offer high throughput for storing, querying, and analyzing bulk data. We evaluated our pipeline by using 250 gigabytes of mouse neuron connectivity data. Benchmarking experiments on retrieving vertices and edges from Accumulo demonstrate that we can achieve 1–2 orders of magnitude speedup in retrieval time when compared to the same operation from traditional flat files. The implementation of graph analytics such as Breadth First Search using Accumulo and D4M offers consistent good performance regardless of data size and density, thus is scalable to very large dataset. Indexing of neuron subvolumes is simple and logical with geohashing-based binary tree encoding. This hybrid data management backend is used to drive an interactive web-based 3D graphical user interface, where users can examine the 3D connectivity map in a Google Map-like viewer. Our pipeline is scalable and extensible to other data modalities.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125680210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1