Indar Sugiarto, Pedro B. Campos, Nizar Dahir, G. Tempesti, S. Furber
{"title":"Optimized task graph mapping on a many-core neuromorphic supercomputer","authors":"Indar Sugiarto, Pedro B. Campos, Nizar Dahir, G. Tempesti, S. Furber","doi":"10.1109/HPEC.2017.8091066","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091066","url":null,"abstract":"This paper presents an approach for improving the overall performance of a general purpose application running as a task graph on a many-core neuromorphic supercomputer. Our task graph framework is based on graceful degradation and amelioration paradigms that strive to achieve high reliability and performance by incorporating fault tolerance and task spawning features. The optimization is applied on an instance of the task graph by performing a soft load balancing on the data traffic between nodes in the graph. We implemented the framework and its optimization on SpiNNaker, a many-core neuromorphic platform containing a million ARM9 processing cores. We evaluate our method using several static mapping examples, where some of them were generated using an evolutionary algorithm. The experiment demonstrates that a performance improvement of up to 8.2% can be achieved when implementing our algorithm on a fully-utilized SpiNNaker communication infrastructure.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114246554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic trace-based sampling algorithm for memory usage tracking of enterprise applications","authors":"Houssem Daoud, Naser Ezzati-Jivan, M. Dagenais","doi":"10.1109/HPEC.2017.8091061","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091061","url":null,"abstract":"Excessive memory usage in software applications has become a frequent issue. A high degree of parallelism and the monitoring difficulty for the developer can quickly lead to memory shortage, or can increase the duration of garbage collection cycles. There are several solutions introduced to monitor memory usage in software. However they are neither efficient nor scalable. In this paper, we propose a dynamic tracing-based sampling algorithm to collect and analyse run time information and metrics for memory usage. It is implemented as a kernel module which gathers memory usage data from operating system structures only when a predefined condition is set or a threshold is passed. The thresholds and conditions are preset but can be changed dynamically, based on the application behavior. We tested our solutions to monitor several applications and our evaluation results show that the proposed method generates compact trace data and reduces the time needed for the analysis, without loosing precision.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121098833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Halappanavar, Hao Lu, A. Kalyanaraman, Antonino Tumeo
{"title":"Scalable static and dynamic community detection using Grappolo","authors":"M. Halappanavar, Hao Lu, A. Kalyanaraman, Antonino Tumeo","doi":"10.1109/HPEC.2017.8091047","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091047","url":null,"abstract":"Graph clustering, popularly known as community detection, is a fundamental kernel for several applications of relevance to the Defense Advanced Research Projects Agency's (DARPA) Hierarchical Identify Verify Exploit (HIVE) Program. Clusters or communities represent natural divisions within a network that are densely connected within a cluster and sparsely connected to the rest of the network. The need to compute clustering on large scale data necessitates the development of efficient algorithms that can exploit modern architectures that are fundamentally parallel in nature. However, due to their irregular and inherently sequential nature, many of the current algorithms for community detection are challenging to parallelize. In response to the HIVE Graph Challenge, we present several parallelization heuristics for fast community detection using the Louvain method as the serial template. We implement all the heuristics in a software library called Grappolo. Using the inputs from the HIVE Challenge, we demonstrate superior performance and high quality solutions based on four parallelization heuristics. We use Grappolo on static graphs as the first step towards community detection on streaming graphs.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"169 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122563181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shreyas G. Singapura, Ajitesh Srivastava, R. Kannan, V. Prasanna
{"title":"OSCAR: Optimizing SCrAtchpad reuse for graph processing","authors":"Shreyas G. Singapura, Ajitesh Srivastava, R. Kannan, V. Prasanna","doi":"10.1109/HPEC.2017.8091070","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091070","url":null,"abstract":"Recently, architectures with scratchpad memory are gaining popularity. These architectures consist of low bandwidth, large capacity DRAM and high bandwidth, user addressable small capacity scratchpad. Existing algorithms must be redesigned to take advantage of the high bandwidth while overcoming the constraint on capacity of scratchpad. In this paper, we propose an optimized edge-centric graph processing algorithm for scratchpad based architectures. Our key contribution is significant reduction in (slower) DRAM accesses through intelligent reuse of scratchpad data. We trade off reduction in DRAM accesses for slightly higher scratchpad accesses. However, due to the much higher bandwidth of scratchpad, the total memory access cost (DRAM + scratchpad) is significantly reduced. We validate our analysis with experiments on real world graphs using a simulator which mimics the scratchpad based architecture using Single Source Shortest Path (SSSP) and Breadth First Search (BFS). Our experimental results demonstrate 1.7× to 2.7× reduction in DRAM accesses leading to an improvement of 1.4× to 2× in total memory (DRAM + scratchpad) accesses.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124294374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Algorithm and hardware co-optimized solution for large SpMV problems","authors":"Fazle Sadi, L. Pileggi, F. Franchetti","doi":"10.1109/HPEC.2017.8091096","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091096","url":null,"abstract":"Sparse Matrix-Vector multiplication (SpMV) is a fundamental kernel for many scientific and engineering applications. However, SpMV performance and efficiency are poor on commercial of-the-shelf (COTS) architectures, specially when the data size exceeds on-chip memory or last level cache (LLC). In this work we present an algorithm co-optimized hardware accelerator for large SpMV problems. We start with exploring the basic difference in data transfer characteristics for various SpMV algorithms. We propose an algorithm that requires the least amount of data transfer while ensuring main memory streaming for all accesses. However, the proposed algorithm requires an efficient multi-way merge, which is difficult to achieve with COTS architectures. Hence, we propose a hardware accelerator model that includes an Application Specific Integrated Circuit (ASIC) for the muti-way merge operation. The proposed accelerator incorporates state of the art 3D stacked High Bandwidth Memory (HBM) in order to demonstrate the proposed algorithm's capability coupled with the latest technologies. Simulation results using standard benchmarks show improvements of over 100× against COTS architectures with commercial libraries for both energy efficiency and performance.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131385640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Lynch, Jose Borreguero Calvo, E. Deelman, Rafael Ferreira da Silva, Monojoy Goswami, Yawei Hui, E. Lingerfelt, J. Vetter
{"title":"Distributed workflows for modeling experimental data","authors":"V. Lynch, Jose Borreguero Calvo, E. Deelman, Rafael Ferreira da Silva, Monojoy Goswami, Yawei Hui, E. Lingerfelt, J. Vetter","doi":"10.1109/HPEC.2017.8091071","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091071","url":null,"abstract":"Modeling helps explain the fundamental physics hidden behind experimental data. In the case of material modeling, running one simulation rarely results in output that reproduces the experimental data. Often one or more of the force field parameters are not precisely known and must be optimized for the output to match that of the experiment. Since the simulations require high performance computing (HPC) resources and there are usually many simulations to run, a workflow is very useful to prevent errors and assure that the simulations are identical except for the parameters that need to be varied. The use of HPC implies distributed workflows, but the optimization and steps to compare the simulation results and experimental data are done on a local workstation. We will present results from force field refinement of data collected at the Spallation Neutron Source using Kepler, Pegasus, and BEAM workflows and discuss what we have learned from using these workflows.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133462399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel k-truss decomposition on multicore systems","authors":"H. Kabir, Kamesh Madduri","doi":"10.1109/HPEC.2017.8091052","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091052","url":null,"abstract":"We discuss our submission to the HPEC 2017 Static Graph Challenge on k-truss decomposition and triangle counting. Our results use an algorithm called PKT (Parallel k-truss) designed for multicore systems. We are able to process almost all Graph Challenge datasets in under a minute on a 24-core server with 128 GB memory. For a synthetic Graph500 graph with 17 million vertices and 523 million edges, triangle counting takes 16 seconds and truss decomposition takes 29 minutes on the 24-core server.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117271440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas B. Rolinger, T. Simon, Christopher D. Krieger
{"title":"Performance challenges for heterogeneous distributed tensor decompositions","authors":"Thomas B. Rolinger, T. Simon, Christopher D. Krieger","doi":"10.1109/HPEC.2017.8091023","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091023","url":null,"abstract":"Tensor decompositions, which are factorizations of multi-dimensional arrays, are becoming increasingly important in large-scale data analytics. A popular tensor decomposition algorithm is Canonical Decomposition/Parallel Factorization using alternating least squares fitting (CP-ALS). Tensors that model real-world applications are often very large and sparse, driving the need for high performance implementations of decomposition algorithms, such as CP-ALS, that can take advantage of many types of compute resources. In this work we present ReFacTo, a heterogeneous distributed tensor decomposition implementation based on DeFacTo, an existing distributed memory approach to CP-ALS. DFacTo reduces the critical routine of CP-ALS to a series of sparse matrix-vector multiplications (SpMVs). ReFacTo leverages GPUs within a cluster via MPI to perform these SpMVs and uses OpenMP threads to parallelize other routines. We evaluate the performance of ReFacTo when using NVIDIA's GPU-based cuSPARSE library and compare it to an alternative implementation that uses Intel's CPU-based Math Kernel Library (MKL) for the SpMV. Furthermore, we provide a discussion of the performance challenges of heterogeneous distributed tensor decompositions based on the results we observed. We find that on up to 32 nodes, the SpMV of ReFacTo when using MKL is up to 6.8× faster than ReFacTo when using cuSPARSE.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121922920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rohit Varkey Thankachan, Eric R. Hein, B. Swenson, James P. Fairbanks
{"title":"Integrating productivity-oriented programming languages with high-performance data structures","authors":"Rohit Varkey Thankachan, Eric R. Hein, B. Swenson, James P. Fairbanks","doi":"10.1109/HPEC.2017.8091068","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091068","url":null,"abstract":"This paper shows that Julia provides sufficient performance to bridge the performance gap between productivity-oriented languages and low-level languages for complex memory intensive computation tasks such as graph traversal. We provide performance guidelines for using complex low-level data structures in high productivity languages and present the first parallel integration on the productivity-oriented language side for graph analysis. Performance on the Graph500 benchmark demonstrates that the Julia implementation is competitive with the native C/OpenMP implementation.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"PP 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126755861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Haidar, K. Kabir, Diana Fayad, S. Tomov, J. Dongarra
{"title":"Out of memory SVD solver for big data","authors":"A. Haidar, K. Kabir, Diana Fayad, S. Tomov, J. Dongarra","doi":"10.1109/HPEC.2017.8091029","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091029","url":null,"abstract":"Many applications — from data compression to numerical weather prediction and information retrieval — need to compute large dense singular value decompositions (SVD). When the problems are too large to fit into the computer's main memory, specialized out-of-core algorithms that use disk storage are required. A typical example is when trying to analyze a large data set through tools like MATLAB or Octave, but the data is just too large to be loaded. To overcome this, we designed a class of out-of-memory (OOM) algorithms to reduce, as well as overlap communication with computation. Of particular interest is OOM algorithms for matrices of size m × n, where m >> n or m << n, e.g., corresponding to cases of too many variables, or too many observations. To design OOM SVDs, we first study the communications cost for the SVD techniques as well as for the QR/LQ factorization followed by SVD. We present the theoretical analysis about the data movement cost and strategies to design OOM SVD algorithms. We show performance results for multicore architecture that illustrate our theoretical findings and match our performance models. Moreover, our experimental results show the feasibility and superiority of the OOM SVD.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131105191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}