2019 IEEE High Performance Extreme Computing Conference (HPEC)最新文献_第6页

Embedded Processor-In-Memory Architecture for Accelerating Arithmetic Operations 加速算术运算的嵌入式内存处理器体系结构

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916496

Richard Muri, P. Fortier

引用次数: 3

Improving Scheduling for Irregular Applications with Logarithmic Radix Binning 用对数基数分形改进不规则应用的调度

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916333

James Fox, Alok Tripathy, Oded Green

{"title":"Improving Scheduling for Irregular Applications with Logarithmic Radix Binning","authors":"James Fox, Alok Tripathy, Oded Green","doi":"10.1109/HPEC.2019.8916333","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916333","url":null,"abstract":"Effective scheduling and load balancing of applications on massively multi-threading systems remains challenging despite decades of research, especially for irregular and data dependent problems where the execution control path is unknown until run-time. One of the most widely used load-balancing schemes used for data dependent problems is a parallel prefix sum (PPS) array over the expected amount of work per task, followed by a partitioning of tasks to threads. While sufficient for many systems, it is not ideal for massively multithreaded systems with SIMD/SIMT execution, such as GPUs. More fine-grained load-balancing is needed to effectively utilize SIMD/SIMT units. In this paper we introduce Logarithmic Radix Binning (LRB) as a more suitable alternative to parallel prefix summation for load-balancing on such systems. We show that LRB has better scalability than PPS for high thread counts on Intel’s Knight’s Landing processor and comparable scalability on NVIDIA Volta GPUs. On the application side, we show how LRB improves the performance of PageRank up to 1.75X using the branch-avoiding model. We also show how to better load-balance segmented sort and improve performance on the GPU.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126816068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Cyber Baselining: Statistical properties of cyber time series and the search for stability 网络基线:网络时间序列的统计特性和对稳定性的追求

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916350

A. Schulz, Ethan Aubin, P. Trepagnier, A. Wollaber

引用次数: 0

An Efficient and Composable Parallel Task Programming Library 一个高效且可组合的并行任务编程库

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916447

Chun-Xun Lin, Tsung-Wei Huang, Guannan Guo, Martin D. F. Wong

{"title":"An Efficient and Composable Parallel Task Programming Library","authors":"Chun-Xun Lin, Tsung-Wei Huang, Guannan Guo, Martin D. F. Wong","doi":"10.1109/HPEC.2019.8916447","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916447","url":null,"abstract":"Composability is a key component to improve programmers’ productivity in writing fast market-expanding applications such as parallel machine learning algorithms and big data analytics. These applications exhibit both regular and irregular compute patterns, and are often combined with other functions or libraries to compose a larger program. However, composable parallel processing has taken a back seat in many existing parallel programming libraries, making it difficult to achieve modularity in large-scale parallel programs. In this paper, we introduce a new parallel task programming library using composable tasking graphs. Our library efficiently supports task parallelism together with an intuitive task graph construction and flexible execution API set to enable reusable and composable task dependency graphs. Developers can quickly compose a large parallel program from small and modular parallel building blocks, and easily deploy the program on a multicore machine. We have evaluated our library on real-world applications. Experimental results showed our library can achieve comparable performance to Intel Threading Building Blocks with less coding effort.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123018424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Applying Neuromorphic Computing to Compressive Sensing 神经形态计算在压缩感知中的应用

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916531

R. Scrofano, Douglas Enright, G. Valley

引用次数: 0

Accelerating Sparse Deep Neural Networks on FPGAs fpga上的稀疏深度神经网络加速

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916419

Sitao Huang, Carl Pearson, R. Nagi, Jinjun Xiong, Deming Chen, Wen-mei W. Hwu

引用次数: 17

QxSQA: GPGPU-Accelerated Simulated Quantum Annealer within a Non-Linear Optimization and Boltzmann Sampling Framework 在非线性优化和玻尔兹曼采样框架下的gpgpu加速模拟量子退火

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916450

Dan Padilha, Serge Weinstock, Mark Hodson

{"title":"QxSQA: GPGPU-Accelerated Simulated Quantum Annealer within a Non-Linear Optimization and Boltzmann Sampling Framework","authors":"Dan Padilha, Serge Weinstock, Mark Hodson","doi":"10.1109/HPEC.2019.8916450","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916450","url":null,"abstract":"We introduce QxSQA, a GPGPU-Accelerated Simulated Quantum Annealer based on Path-Integral Monte Carlo (PIMC). QxSQA is tuned for finding low-energy solutions to integer, non-linear optimization problems of up to 214 (16,384) binary variables with quadratic interactions on a single GPU instance. Experimental results demonstrate QxSQA can solve Maximum Clique test problems of 8,100 binary variables with planted solutions in under one minute, with linear scaling against key optimization parameters on other large-scale problems. Through the PIMC formulation, QxSQA also functions as an accurate sampler of Boltzmann distributions for machine learning applications. Experimental characterization of Boltzmann sampling results for a reinforcement learning problem showed good convergence performance at useful scales. Our implementation integrates as a solver within our QxBranch developer platform, positioning developers to efficiently develop applications using QxSQA, and then test the same application code on a quantum annealer or universal quantum computer hardware platform such as those from D-Wave Systems, IBM, or Rigetti Computing.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131414001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Hardware IP Classification through Weighted Characteristics 基于加权特征的硬件IP分类

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916225

Brendan McGeehan, Flora Smith, Thao Le, Hunter Nauman, Jia Di

{"title":"Hardware IP Classification through Weighted Characteristics","authors":"Brendan McGeehan, Flora Smith, Thao Le, Hunter Nauman, Jia Di","doi":"10.1109/HPEC.2019.8916225","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916225","url":null,"abstract":"Today’s business model for hardware designs frequently incorporates third-party Intellectual Property (IP) mainly due to economic motivations. However, allowing third-party involvement also increases the possibility of malicious attacks, such as hardware Trojan insertion, which is a particularly dangerous security threat because functional testing can often leave the Trojan undetected. This research provides an improvement on a Trojan detection method and tool known as Structural Checking which analyzes Register-Transfer Level (RTL) soft IPs. Given an unknown IP, the tool will break down the design and label ports and signals with assets. Analyzing the asset patterns reveals how the IP is structured and provides information about its overall functionality. The tool incorporates a library of known designs referred to as the Golden Reference Library (GRL). All entries in the library, grouped into known-clean and know-infested, are analyzed in the same manner. A weighted percent match for each library entry against the unknown IP is calculated. A report is generated detailing all mismatched locations where users need to take a closer look. Due to the structural variability of soft IP designs, it is vital to provide the best possible weighting to best match the unknown IP to the most similar library entry. This paper provides a statistical approach to finding the best weights to optimize the tool’s matching algorithm.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134395303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

H-INDEX: Hash-Indexing for Parallel Triangle Counting on GPUs H-INDEX: gpu上并行三角形计数的哈希索引

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916492

Santosh Pandey, X. Li, A. Buluç, Jiejun Xu, Hang Liu

{"title":"H-INDEX: Hash-Indexing for Parallel Triangle Counting on GPUs","authors":"Santosh Pandey, X. Li, A. Buluç, Jiejun Xu, Hang Liu","doi":"10.1109/HPEC.2019.8916492","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916492","url":null,"abstract":"Triangle counting is a graph algorithm that calculates the number of triangles involving each vertex in a graph. Briefly, a triangle encompasses three vertices from a graph, where every vertex possesses at least one incidental edge to the other two vertices from the triangle. Consequently, list intersection, which identifies the incidental edges, becomes the core algorithm for triangle counting. At the meantime, attracted by the enormous parallel computing potential of Graphics Processing Units (GPUs), numerous efforts have been devoted to deploy triangle counting algorithms on GPUs.While state-of-the-art intersection algorithms, such as merge-path and binary-search, perform well on traditional multi-core CPU systems, deploying them on massively parallel GPUs turns out to be challenging. In particular, merge-path based approach experiences the hardship of evenly distributing the workload across vast GPU threads and irregular memory accesses. Binary-search based approach often suffers from the potential problem of high time complexity. Furthermore, both approaches require sorted neighbor lists from the input graphs, which involves nontrivial preprocessing overhead. To this end, we introduce H-INDEX, a hash-indexing assisted triangle counting algorithm that overcomes all the aforementioned shortcomings. Notably, HINDEX achieves 141.399 billion TEPS computing rate on a Protein K-mer V2a graph with 64 GPUs. To the best of our knowledge, this is the first work that advances triangle counting beyond the 100 billion TEPS rate.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129962513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Performance of Training Sparse Deep Neural Networks on GPUs 稀疏深度神经网络在gpu上的训练性能

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916506

Jianzong Wang, Zhangcheng Huang, Lingwei Kong, Jing Xiao, Pengyu Wang, Lu Zhang, Chao Li

{"title":"Performance of Training Sparse Deep Neural Networks on GPUs","authors":"Jianzong Wang, Zhangcheng Huang, Lingwei Kong, Jing Xiao, Pengyu Wang, Lu Zhang, Chao Li","doi":"10.1109/HPEC.2019.8916506","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916506","url":null,"abstract":"Deep neural networks have revolutionized the field of machine learning by dramatically improving the state-of-the-art in various domains. The sizes of deep neural networks (DNNs) are rapidly outgrowing the capacity of hardware to fast store and train them. Over the past few decades, researches have explored the prospect of sparse DNNs before, during, and after training by pruning edges from the underlying topology. After the above operation, the generated neural network is known as a sparse neural network. More recent works have demonstrated the remarkable results that certain sparse DNNs can train to the same precision as dense DNNs at lower runtime and storage cost. Although existing methods ease the situation that high demand for computation resources severely hinders the deployment of large-scale DNNs in resource-constrained devices, DNNs can be trained at a faster speed and lower cost. In this work, we propose a Fine-tune Structured Sparsity Learning (FSSL) method to regularize the structures of DNNs and accelerate the training of DNNs. FSSL can: (1) learn a compact structure from large sparse DNN to reduce computation cost; (2) obtain a hardware-friendly to accelerate the DNNs evaluation efficiently. Experimental results of the training time and the compression rate show that superior performance and efficiency than the Matlab example code. These speedups are about twice speedups of non-structured sparsity.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121641754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10