2017 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

Optimized task graph mapping on a many-core neuromorphic supercomputer 在多核神经形态超级计算机上优化任务图映射

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-14 DOI: 10.1109/HPEC.2017.8091066

Indar Sugiarto, Pedro B. Campos, Nizar Dahir, G. Tempesti, S. Furber

引用次数: 6

Dynamic trace-based sampling algorithm for memory usage tracking of enterprise applications 基于动态跟踪的企业应用程序内存使用跟踪抽样算法

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091061

Houssem Daoud, Naser Ezzati-Jivan, M. Dagenais

引用次数: 3

Scalable static and dynamic community detection using Grappolo 可扩展的静态和动态社区检测使用Grappolo

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091047

M. Halappanavar, Hao Lu, A. Kalyanaraman, Antonino Tumeo

{"title":"Scalable static and dynamic community detection using Grappolo","authors":"M. Halappanavar, Hao Lu, A. Kalyanaraman, Antonino Tumeo","doi":"10.1109/HPEC.2017.8091047","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091047","url":null,"abstract":"Graph clustering, popularly known as community detection, is a fundamental kernel for several applications of relevance to the Defense Advanced Research Projects Agency's (DARPA) Hierarchical Identify Verify Exploit (HIVE) Program. Clusters or communities represent natural divisions within a network that are densely connected within a cluster and sparsely connected to the rest of the network. The need to compute clustering on large scale data necessitates the development of efficient algorithms that can exploit modern architectures that are fundamentally parallel in nature. However, due to their irregular and inherently sequential nature, many of the current algorithms for community detection are challenging to parallelize. In response to the HIVE Graph Challenge, we present several parallelization heuristics for fast community detection using the Louvain method as the serial template. We implement all the heuristics in a software library called Grappolo. Using the inputs from the HIVE Challenge, we demonstrate superior performance and high quality solutions based on four parallelization heuristics. We use Grappolo on static graphs as the first step towards community detection on streaming graphs.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"169 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122563181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

OSCAR: Optimizing SCrAtchpad reuse for graph processing OSCAR:优化图形处理的刮板重用

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091070

Shreyas G. Singapura, Ajitesh Srivastava, R. Kannan, V. Prasanna

{"title":"OSCAR: Optimizing SCrAtchpad reuse for graph processing","authors":"Shreyas G. Singapura, Ajitesh Srivastava, R. Kannan, V. Prasanna","doi":"10.1109/HPEC.2017.8091070","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091070","url":null,"abstract":"Recently, architectures with scratchpad memory are gaining popularity. These architectures consist of low bandwidth, large capacity DRAM and high bandwidth, user addressable small capacity scratchpad. Existing algorithms must be redesigned to take advantage of the high bandwidth while overcoming the constraint on capacity of scratchpad. In this paper, we propose an optimized edge-centric graph processing algorithm for scratchpad based architectures. Our key contribution is significant reduction in (slower) DRAM accesses through intelligent reuse of scratchpad data. We trade off reduction in DRAM accesses for slightly higher scratchpad accesses. However, due to the much higher bandwidth of scratchpad, the total memory access cost (DRAM + scratchpad) is significantly reduced. We validate our analysis with experiments on real world graphs using a simulator which mimics the scratchpad based architecture using Single Source Shortest Path (SSSP) and Breadth First Search (BFS). Our experimental results demonstrate 1.7× to 2.7× reduction in DRAM accesses leading to an improvement of 1.4× to 2× in total memory (DRAM + scratchpad) accesses.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124294374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Algorithm and hardware co-optimized solution for large SpMV problems 大型SpMV问题的算法与硬件协同优化解决方案

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091096

Fazle Sadi, L. Pileggi, F. Franchetti

{"title":"Algorithm and hardware co-optimized solution for large SpMV problems","authors":"Fazle Sadi, L. Pileggi, F. Franchetti","doi":"10.1109/HPEC.2017.8091096","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091096","url":null,"abstract":"Sparse Matrix-Vector multiplication (SpMV) is a fundamental kernel for many scientific and engineering applications. However, SpMV performance and efficiency are poor on commercial of-the-shelf (COTS) architectures, specially when the data size exceeds on-chip memory or last level cache (LLC). In this work we present an algorithm co-optimized hardware accelerator for large SpMV problems. We start with exploring the basic difference in data transfer characteristics for various SpMV algorithms. We propose an algorithm that requires the least amount of data transfer while ensuring main memory streaming for all accesses. However, the proposed algorithm requires an efficient multi-way merge, which is difficult to achieve with COTS architectures. Hence, we propose a hardware accelerator model that includes an Application Specific Integrated Circuit (ASIC) for the muti-way merge operation. The proposed accelerator incorporates state of the art 3D stacked High Bandwidth Memory (HBM) in order to demonstrate the proposed algorithm's capability coupled with the latest technologies. Simulation results using standard benchmarks show improvements of over 100× against COTS architectures with commercial libraries for both energy efficiency and performance.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131385640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Distributed workflows for modeling experimental data 为实验数据建模的分布式工作流

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091071

V. Lynch, Jose Borreguero Calvo, E. Deelman, Rafael Ferreira da Silva, Monojoy Goswami, Yawei Hui, E. Lingerfelt, J. Vetter

引用次数: 1

Parallel k-truss decomposition on multicore systems 多核系统的并行k-桁架分解

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091052

H. Kabir, Kamesh Madduri

引用次数: 33

Performance challenges for heterogeneous distributed tensor decompositions 异构分布张量分解的性能挑战

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091023

Thomas B. Rolinger, T. Simon, Christopher D. Krieger

{"title":"Performance challenges for heterogeneous distributed tensor decompositions","authors":"Thomas B. Rolinger, T. Simon, Christopher D. Krieger","doi":"10.1109/HPEC.2017.8091023","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091023","url":null,"abstract":"Tensor decompositions, which are factorizations of multi-dimensional arrays, are becoming increasingly important in large-scale data analytics. A popular tensor decomposition algorithm is Canonical Decomposition/Parallel Factorization using alternating least squares fitting (CP-ALS). Tensors that model real-world applications are often very large and sparse, driving the need for high performance implementations of decomposition algorithms, such as CP-ALS, that can take advantage of many types of compute resources. In this work we present ReFacTo, a heterogeneous distributed tensor decomposition implementation based on DeFacTo, an existing distributed memory approach to CP-ALS. DFacTo reduces the critical routine of CP-ALS to a series of sparse matrix-vector multiplications (SpMVs). ReFacTo leverages GPUs within a cluster via MPI to perform these SpMVs and uses OpenMP threads to parallelize other routines. We evaluate the performance of ReFacTo when using NVIDIA's GPU-based cuSPARSE library and compare it to an alternative implementation that uses Intel's CPU-based Math Kernel Library (MKL) for the SpMV. Furthermore, we provide a discussion of the performance challenges of heterogeneous distributed tensor decompositions based on the results we observed. We find that on up to 32 nodes, the SpMV of ReFacTo when using MKL is up to 6.8× faster than ReFacTo when using cuSPARSE.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121922920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Integrating productivity-oriented programming languages with high-performance data structures 集成面向生产力的编程语言和高性能数据结构

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091068

Rohit Varkey Thankachan, Eric R. Hein, B. Swenson, James P. Fairbanks

引用次数: 3

Out of memory SVD solver for big data 内存不足的SVD解决大数据

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091029

A. Haidar, K. Kabir, Diana Fayad, S. Tomov, J. Dongarra

{"title":"Out of memory SVD solver for big data","authors":"A. Haidar, K. Kabir, Diana Fayad, S. Tomov, J. Dongarra","doi":"10.1109/HPEC.2017.8091029","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091029","url":null,"abstract":"Many applications — from data compression to numerical weather prediction and information retrieval — need to compute large dense singular value decompositions (SVD). When the problems are too large to fit into the computer's main memory, specialized out-of-core algorithms that use disk storage are required. A typical example is when trying to analyze a large data set through tools like MATLAB or Octave, but the data is just too large to be loaded. To overcome this, we designed a class of out-of-memory (OOM) algorithms to reduce, as well as overlap communication with computation. Of particular interest is OOM algorithms for matrices of size m × n, where m >> n or m << n, e.g., corresponding to cases of too many variables, or too many observations. To design OOM SVDs, we first study the communications cost for the SVD techniques as well as for the QR/LQ factorization followed by SVD. We present the theoretical analysis about the data movement cost and strategies to design OOM SVD algorithms. We show performance results for multicore architecture that illustrate our theoretical findings and match our performance models. Moreover, our experimental results show the feasibility and superiority of the OOM SVD.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131105191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13