2017 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

筛选
英文 中文
Investigating TI KeyStone II and quad-core ARM Cortex-A53 architectures for on-board space processing 研究用于机载空间处理的TI KeyStone II和四核ARM Cortex-A53架构
2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091094
B. Schwaller, B. Ramesh, A. George
{"title":"Investigating TI KeyStone II and quad-core ARM Cortex-A53 architectures for on-board space processing","authors":"B. Schwaller, B. Ramesh, A. George","doi":"10.1109/HPEC.2017.8091094","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091094","url":null,"abstract":"Future space missions require reliable architectures with higher performance and lower power consumption. Exploring new architectures worthy of undergoing the expensive and time-consuming process of radiation hardening is critical for this endeavor. Two such architectures are the Texas Instruments KeyStone II octal-core processor and the ARM® Cortex®-A53 (ARMv8) quad-core CPU. DSPs have been proven in prior space applications, and the KeyStone II has eight high-performance DSP cores and is under consideration for potential hardening for space. Meanwhile, a radiation-hardened quad-core ARM Cortex-A53 CPU is under development at Boeing under the NASA/AFRL High-Performance Spaceflight Computing initiative. In this paper, we optimize and evaluate the performance of batched 1D-FFTs, 2D-FFTs, and the Complex Ambiguity Function (CAF). We developed a direct memory-access scheme to take advantage of the complex KeyStone architecture for FFTs. Our results for batched 1D-FFTs show that the performance per Watt of KeyStone II is 4.5 times better than the ARM Cortex-A53. For CAF, our results show that the KeyStone II is 1.7 times better.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"137 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128700706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Towards numerical benchmark for half-precision floating point arithmetic 半精度浮点运算的数值基准
2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091031
P. Luszczek, J. Kurzak, I. Yamazaki, J. Dongarra
{"title":"Towards numerical benchmark for half-precision floating point arithmetic","authors":"P. Luszczek, J. Kurzak, I. Yamazaki, J. Dongarra","doi":"10.1109/HPEC.2017.8091031","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091031","url":null,"abstract":"With NVIDA Tegra Jetson X1 and Pascal P100 GPUs, NVIDIA introduced hardware-based computation on FP16 numbers also called half-precision arithmetic. In this talk, we will introduce the steps required to build a viable benchmark for this new arithmetic format. This will include the connections to established IEEE floating point standards and existing HPC benchmarks. The discussion will focus on performance and numerical stability issues that are important for this kind of benchmarking and how they relate to NVIDIA platforms.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134041593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Application of convolutional neural networks on Intel® Xeon® processor with integrated FPGA 卷积神经网络在Intel®Xeon®集成FPGA处理器上的应用
2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091025
Philip Colangelo, Enno Lübbers, Randy Huang, M. Margala, Kevin Nealis
{"title":"Application of convolutional neural networks on Intel® Xeon® processor with integrated FPGA","authors":"Philip Colangelo, Enno Lübbers, Randy Huang, M. Margala, Kevin Nealis","doi":"10.1109/HPEC.2017.8091025","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091025","url":null,"abstract":"Intel®'s Xeon® processor with integrated FPGA is a new research platform that provides all the capabilities of a Broadwell Xeon Processor with the added functionality of an Arria 10 FPGA in the same package. In this paper, we present an implementation on this platform to showcase the abilities and effectiveness of utilizing both hardware architectures to accelerate a convolutional based neural network (CNN). We choose a network topology that uses binary weights and low precision activation data to take advantage of the available customizable fabric provided by the FPGA. Further, compared to standard multiply accumulate CNN's, binary weighted networks (BWN) reduce the amount of computation by eliminating the need for multiplication resulting in little to no classification accuracy degradation. Coupling Intel's Open Programmable Acceleration Engine (OPAE) with Caffe provides a robust framework that was used as the foundation for our application. Due to the convolution primitives taking the most computation in our network, we offload the feature and weight data to a customized binary convolution accelerator loaded in the FPGA. Employing the low latency Quick Path Interconnect (QPI) that bridges the Broadwell Xeon processor and Arria 10 FPGA, we can carry out fine-grained offloads while avoiding bandwidth bottlenecks. An initial proof of concept design showcasing this new platform that utilizes only a portion of the FPGA core logic exemplifies that by using both the Xeon processor and FPGA together we can improve the throughput by 2× on some layers and by 1.3× overall.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131791530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
WCET analysis of the shared data cache in integrated CPU-GPU architectures 集成CPU-GPU架构中共享数据缓存的WCET分析
2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091059
Y. Huangfu, Wei Zhang
{"title":"WCET analysis of the shared data cache in integrated CPU-GPU architectures","authors":"Y. Huangfu, Wei Zhang","doi":"10.1109/HPEC.2017.8091059","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091059","url":null,"abstract":"By taking the advantages of both CPU and GPU as well as the shared DRAM and cache, the integrated CPU-GPU architecture has the potential to boost the performance for a variety of applications, including real-time applications as well. However, before being applied to the hard real-time and safety-critical applications, the time-predictability of the integrated CPU-GPU architecture needs to be studied and improved. In this work, we study the shared data Last Level Cache (LLC) in the integrated CPU-GPU architecture and propose to use an access interval based method to improve the time-predictability of the LLC. The results show that the proposed technique can effectively improve the accuracy of the miss rate estimation in the LLC. We also find that the improved LLC miss rate estimations can be used to further improve the WCET estimations of GPU kernels running on such an architecture.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134009926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Power-aware computing: Measurement, control, and performance analysis for Intel Xeon Phi 功耗感知计算:Intel Xeon Phi的测量、控制和性能分析
2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091085
A. Haidar, Heike Jagode, A. YarKhan, Phil Vaccaro, S. Tomov, J. Dongarra
{"title":"Power-aware computing: Measurement, control, and performance analysis for Intel Xeon Phi","authors":"A. Haidar, Heike Jagode, A. YarKhan, Phil Vaccaro, S. Tomov, J. Dongarra","doi":"10.1109/HPEC.2017.8091085","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091085","url":null,"abstract":"The emergence of power efficiency as a primary constraint in processor and system designs poses new challenges concerning power and energy awareness for numerical libraries and scientific applications. Power consumption also plays a major role in the design of data centers in particular for peta- and exa-scale systems. Understanding and improving the energy efficiency of numerical simulation becomes very crucial. We present a detailed study and investigation toward controlling power usage and exploring how different power caps affect the performance of numerical algorithms with different computational intensities, and determine the impact and correlation with performance of scientific applications. Our analyses is performed using a set of representatives kernels, as well as many highly used scientific benchmarks. We quantify a number of power and performance measurements, and draw observations and conclusions that can be viewed as a roadmap toward achieving energy efficiency computing algorithms.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114180675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Fast linear algebra-based triangle counting with KokkosKernels 快速线性代数为基础的三角形计数与KokkosKernels
2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091043
Michael M. Wolf, Mehmet Deveci, Jonathan W. Berry, S. Hammond, S. Rajamanickam
{"title":"Fast linear algebra-based triangle counting with KokkosKernels","authors":"Michael M. Wolf, Mehmet Deveci, Jonathan W. Berry, S. Hammond, S. Rajamanickam","doi":"10.1109/HPEC.2017.8091043","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091043","url":null,"abstract":"Triangle counting serves as a key building block for a set of important graph algorithms in network science. In this paper, we address the IEEE HPEC Static Graph Challenge problem of triangle counting, focusing on obtaining the best parallel performance on a single multicore node. Our implementation uses a linear algebra-based approach to triangle counting that has grown out of work related to our miniTri data analytics miniapplication [1] and our efforts to pose graph algorithms in the language of linear algebra. We leverage KokkosKernels to implement this approach efficiently on multicore architectures. Our performance results are competitive with the fastest known graph traversal-based approaches and are significantly faster than the Graph Challenge reference implementations, up to 670,000 times faster than the C++ reference and 10,000 times faster than the Python reference on a single Intel Haswell node.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117052302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Triangle counting for scale-free graphs at scale in distributed memory 在分布式内存中按比例计算无比例图的三角形计数
2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091051
R. Pearce
{"title":"Triangle counting for scale-free graphs at scale in distributed memory","authors":"R. Pearce","doi":"10.1109/HPEC.2017.8091051","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091051","url":null,"abstract":"Triangle counting has long been a challenge problem for sparse graphs containing high-degree \"hub\" vertices that exist in many real-world scenarios. These high-degree vertices create a quadratic number of wedges, or 2-edge paths, which for brute force algorithms require closure checking or wedge checks. Our work-in-progress builds on existing heuristics for pruning the number of wedge checks by ordering based on degree and other simple metrics. Such heuristics can dramatically reduce the number of required wedge checks for exact triangle counting for both real and synthetic scale-free graphs. Our triangle counting algorithm is implemented using HavoqGT, an asynchronous vertex-centric graph analytics framework for distributed memory. We present a brief experimental evaluation on two large real scale-free graphs: a 128B edge web-graph and a 1.4B edge twitter follower graph, and a weak scaling study on synthetic Graph500 RMAT graphs up to 274.9 billion edges.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123028113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Sparse matrix assembly on the GPU through multiplication patterns 稀疏矩阵在GPU上通过乘法模式组装
2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091057
Rhaleb Zayer, M. Steinberger, H. Seidel
{"title":"Sparse matrix assembly on the GPU through multiplication patterns","authors":"Rhaleb Zayer, M. Steinberger, H. Seidel","doi":"10.1109/HPEC.2017.8091057","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091057","url":null,"abstract":"The numerical treatment of variational problems gives rise to large sparse matrices, which are typically assembled by coalescing elementary contributions. As the explicit matrix form is required by numerical solvers, the assembly step can be a potential bottleneck, especially in implicit and time dependent settings where considerable updates are needed. On standard HPC platforms, this process can be vectorized by taking advantage of additional mesh querying data structures. However, on graphics hardware, vectorization is inhibited by limited memory resources. In this paper, we propose a lean unstructured mesh representation, which allows casting the assembly problem as a sparse matrix-matrix multiplication. We demonstrate how the global graph connectivity of the assembled matrix can be captured through basic linear algebra operations and show how local interactions between nodes/degrees of freedom within an element can be encoded by means of concise representation, action maps. These ideas not only reduce the memory storage requirements but also cut down on the bulk of data that needs to be moved from global storage to the compute units, which is crucial on parallel computing hardware, and in particular on the GPU. Furthermore, we analyze the effect of mesh memory layout on the assembly performance.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122893597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Ultra-high fidelity radio frequency propagation modeling using distributed high performance graphical processing units: A simulator for multi-element non-stationary antenna systems 使用分布式高性能图形处理单元的超高保真无线电频率传播建模:多单元非固定天线系统的模拟器
2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091082
Mark D. Barnell, Nathan Stokes, Jason Steeger, Jessie Grabowski
{"title":"Ultra-high fidelity radio frequency propagation modeling using distributed high performance graphical processing units: A simulator for multi-element non-stationary antenna systems","authors":"Mark D. Barnell, Nathan Stokes, Jason Steeger, Jessie Grabowski","doi":"10.1109/HPEC.2017.8091082","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091082","url":null,"abstract":"A newly-invented, distributed, high-performance graphical processing framework that simulates complex radio frequency (RF) propagation has been developed and demonstrated. The approach uses an advanced computer architecture and intensive multi-core system to enable highperformance data analysis at the fidelity necessary to design and develop modern sensor systems. This widely applicable simulation and modeling technology aids in the design and development of state-of-the-art systems with complex waveforms and more advanced downstream exploitation techniques, e.g., systems with arbitrary RF waveforms, higher RF bandwidths and increasing resolution. The recent breakthroughs in computing hardware, software, systems and applications has enabled these concepts to be tested and demonstrated in a large variety of environments and early in the design cycle. Improvements in simulation accuracies and simulation timescales have been made that immediately increase the value to the end user. A near-analytic RF propagation model increased the computational need by orders of magnitude. This model also increased required numerical precision. The new general purpose graphics processing units (GPGPUs) provided the capability to simulate the propagation effects and model it with the necessary information dependence, and floating point mathematics where performance matters. The relative performance improvement between the baseline MATLAB® parallelized simulation and the equivalent GPU based simulation using 12 NVIDIA Tesla K20m GPUs on the Offspring High-Performance Computer (HPC) using the AirWASP© framework decreased simulation and modeling from 16.5 days to less than 1 day.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116873088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A cloud-based brain connectivity analysis tool 基于云的大脑连接分析工具
2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091080
L. Brattain, Mihnea Bulugioiu, Adam Brewster, Mark Hernandez, Heejin Choi, T. Ku, Kwanghun Chung, V. Gadepally
{"title":"A cloud-based brain connectivity analysis tool","authors":"L. Brattain, Mihnea Bulugioiu, Adam Brewster, Mark Hernandez, Heejin Choi, T. Ku, Kwanghun Chung, V. Gadepally","doi":"10.1109/HPEC.2017.8091080","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091080","url":null,"abstract":"With advances in high throughput brain imaging at the cellular and sub-cellular level, there is growing demand for platforms that can support high performance, large-scale brain data processing and analysis. In this paper, we present a novel pipeline that combines Accumulo, D4M, geohashing, and parallel programming to manage large-scale neuron connectivity graphs in a cloud environment. Our brain connectivity graph is represented using vertices (fiber start/end nodes), edges (fiber tracks), and the 3D coordinates of the fiber tracks. For optimal performance, we take the hybrid approach of storing vertices and edges in Accumulo and saving the fiber track 3D coordinates in flat files. Accumulo database operations offer low latency on sparse queries while flat files offer high throughput for storing, querying, and analyzing bulk data. We evaluated our pipeline by using 250 gigabytes of mouse neuron connectivity data. Benchmarking experiments on retrieving vertices and edges from Accumulo demonstrate that we can achieve 1–2 orders of magnitude speedup in retrieval time when compared to the same operation from traditional flat files. The implementation of graph analytics such as Breadth First Search using Accumulo and D4M offers consistent good performance regardless of data size and density, thus is scalable to very large dataset. Indexing of neuron subvolumes is simple and logical with geohashing-based binary tree encoding. This hybrid data management backend is used to drive an interactive web-based 3D graphical user interface, where users can examine the 3D connectivity map in a Google Map-like viewer. Our pipeline is scalable and extensible to other data modalities.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125680210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信