Angela Pohl, Mirko Greese, Biagio Cosenza, B. Juurlink
{"title":"A Performance Analysis of Vector Length Agnostic Code","authors":"Angela Pohl, Mirko Greese, Biagio Cosenza, B. Juurlink","doi":"10.1109/HPCS48598.2019.9188238","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188238","url":null,"abstract":"Vector extensions are a popular mean to exploit data parallelism in applications. Over recent years, the most commonly used extensions have been growing in vector length and amount of vector instructions. However, code portability remains a problem when speaking about a compute continuum. Hence, vector length agnostic (VLA) architectures have been proposed for the future generations of ARM and RISC-V processors. With these architectures, code is vectorized independently of the vector length of the target hardware platform. It is therefore possible to tune software to a generic vector length. To understand the performance impact of VLA code compared to vector length specific code, we analyze the current capabilities of code generation for ARM’s SVE architecture. Our experiments show that VLA code reaches about 90% of the performance of vector length specific code, i.e. a 10% overhead is inferred due to global predication of instructions. Furthermore, we show that code performance is not increasing proportionally with increasing vector lengths due to the higher memory demands.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124943827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sandra Gómez Canaval, V. Mitrana, M. Păun, Stanislav Vararuk
{"title":"High Performance and Scalable Simulations of a Bio-inspired Computational Model","authors":"Sandra Gómez Canaval, V. Mitrana, M. Păun, Stanislav Vararuk","doi":"10.1109/HPCS48598.2019.9188187","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188187","url":null,"abstract":"The Network of Polarized Evolutionary Processors (NPEP) is a rather new variant of the bio-inspired computing model called Network of Evolutionary Processors (NEP). This model, together with its variants, is able to provide theoretical feasible solutions to hard computational problems. NPEPE is a software engine able to simulate NPEP which is deployed over Giraph, an ultra-scalable platform based on the Bulk Synchronous Parallel (BSP) programming model. Rather surprisingly, the BSP model and the underlying architecture of NPEP have many common points. Moreover, these similarities are also shared with all variants in the NEP family. We take advantage of these similarities and propose an extension of NPEPE (named gNEP) that enhances it to simulate any variant of the NEP’s family. Our extended gNEP framework, presents a twofold contribution. Firstly, a flexible architecture able to extend software components in order to include other NEP models (including the seminal NEP model and new ones). Secondly, a component able to translate input configuration files representing the instance of a problem and an algorithm based on different variants of the NEP model into some suitable input files for gNEP framework. In this work, we simulate a solution to the “3-colorability” problem which is based on NPEP. We compare the results for a specific experiment using NPEPE engine and gNEP. Moreover, we show several experiments in the aim of studying, in a preliminary way, the scalability offered by gNEP to easily deploy and execute instances of problems requiring more intensive computations.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126134616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Hoppe, S. Adami, N. Adams, I. Pasichnyk, M. Allalen
{"title":"Node-Level optimization of a 3D Block-Based Multiresolution Compressible Flow Solver with Emphasis on Performance Portability","authors":"N. Hoppe, S. Adami, N. Adams, I. Pasichnyk, M. Allalen","doi":"10.1109/HPCS48598.2019.9188088","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188088","url":null,"abstract":"Despite the enormous increase in computational power in the last decades, the numerical study of complex flows remains challenging. State-of-the-art techniques to simulate hyperbolic flows with discontinuities rely on computationally demanding nonlinear schemes, such as Riemann solvers with weighted essentially non-oscillatory (WENO) stencils and characteristic decompositioning. To handle this complexity the numerical load can be reduced via a multiresolution (MR) algorithm with local time stepping (LTS) running on modern high-performance computing (HPC) systems. Eventually, the main challenge lies in an efficitent utilization of the available HPC hardware. In this work, we evaluate the performance improvement for a Message Passing Interface (MPI)-parallelized MR solver using single instruction multiple data (SIMD) optimizations. We present straight-forward code modifications that allow for auto-vectorization by the compiler, while maintaining the modularity of the code at comparable performance. We demonstrate performance improvements for representative Euler flow examples on both Intel Haswell and Intel Knights Landing Xeon Phi microarchitecture (KNL) clusters. The tests show single-core speedups of 1.7 (1.9) and average speedups of 1.4 (1.6) for the Haswell (KNL).","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123433761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Incremental Parallel PGAS-based Tree Search Algorithm","authors":"T. Carneiro, N. Melab","doi":"10.1109/HPCS48598.2019.9188106","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188106","url":null,"abstract":"In this work, we show that the Chapel highproductivity language is suitable for the design and implementation of all aspects involved in the conception of parallel tree search algorithms for solving combinatorial problems. Initially, it is possible to hand-optimize the data structures involved in the search process in a way equivalent to C. As a consequence, the single-threaded search in Chapel is on average only 7% slower than its counterpart written in C. Whereas programming a multicore tree search in Chapel is equivalent to C-OpenMP in terms of performance and programmability, its productivityaware features for distributed programming stand out. It is possible to incrementally conceive a distributed tree search algorithm starting from its multicore counterpart by adding few lines of code. The distributed implementation performs load balancing among different computer nodes and also exploits all CPU cores of the system. Chapel presents an interesting tradeoff between programmability and performance despite the high level of its features. The distributed tree search in Chapel is on average 16% slower and reaches up to 80% of the scalability achieved by its C-MPI + OpenMP counterpart.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123691853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Independent Tasks","authors":"Ali Mohammed, Aurélien Cavelan, F. Ciorba","doi":"10.1109/HPCS48598.2019.9188153","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188153","url":null,"abstract":"Parallel scientific applications that execute on high performance computing (HPC) systems often contain large and computationally-intensive parallel loops. The independent loop iterations of such applications represent independent tasks. Dynamic toad balancing (DLB) is used to achieve a balanced execution of such applications. However, most of the self-scheduling-based techniques that are typically used to achieve DLB are not robust against component (e.g., processors, network) failures or perturbations that arise on large HPC systems. The self-scheduling-based techniques that tolerate failures and/or perturbations rely on the existence of fault-and/or perturbation-detection mechanisms to trigger the rescheduling of tasks scheduled onto failed and/or perturbed components. This work proposes a novel robust dynamic load balancing (rDLB) approach for the robust self-scheduling of scientific applications with independent tasks on HPC systems under failures and/or perturbations. rDLB proactively reschedules already allocated tasks and requires no detection of failures or perturbations. Moreover, rDLB is integrated into an MPI-based DLB library. An analytical modeling of rDLB shows that for a fixed problem size, the fault-tolerance overhead linearly decreases with the number of processors. The experimental evaluation shows that applications using rDLB tolerate up to P-l worker processor failures (P-is the number of processors allocated to the application) and that their performance in the presence of perturbations improved by a factor of 7 compared to the case without rDLB. Moreover, the robustness of applications against perturbations (i.e., flexibility) is boosted by a factor of 30 using rDLB compared to the case without rDLB.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123728598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Blaine Corwyn D. Aboy, Edward James A. Bariring, J. P. Carandang, F. Cabarle, R. T. Cruz, H. Adorna, Miguel A. Martínez-del-Amor
{"title":"optimizations in CuSNP Simulator for Spiking Neural P Systems on CUDA GPUs","authors":"Blaine Corwyn D. Aboy, Edward James A. Bariring, J. P. Carandang, F. Cabarle, R. T. Cruz, H. Adorna, Miguel A. Martínez-del-Amor","doi":"10.1109/HPCS48598.2019.9188174","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188174","url":null,"abstract":"Spiking Neural P systems (in short, SNP systems) are computing models based on living neurons. SNP systems are non-deterministic and parallel, hence making use of a parallel processor such as a graphics processing unit (in short, GPU) is a natural candidate for simulations. Matrix representations and algorithms were previously developed for simulating SNP systems. In this work, our two results extend previous works in simulating SNP systems in the GPU: (a) the number of neurons the simulator can handle is now arbitrary; (b) SNP systems are now represented in a dense instead of sparse way. The impact in terms of time and space of these extensions to the GPU simulator are analysed. As expected, SNP systems with more neurons need more simulation time, although the simulator performance can scale (i.e. perform better) with larger GPUs. The dense representation helps in the simulation of larger systems.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125519430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The impact of the AC922 Architecture on Performance of Deep Neural Network Training","authors":"P. Rosciszewski, Michał Iwański, P. Czarnul","doi":"10.1109/HPCS48598.2019.9188164","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188164","url":null,"abstract":"Practical deep learning applications require more and more computing power. New computing architectures emerge, specifically designed for the artificial intelligence applications, including the IBM Power System AC922. In this paper we confront an AC922 (8335-GTG) server equipped with 4 NVIDIA Volta V100 GPUs with selected deep neural network training applications, including four convolutional and one recurrent model. We report performance results depending on batch sizes and GPU selection and compare them with the results from another contemporary workstation based on the same set of GPUs – NVIDIA® DGX Station™. The results show that the AC922 performs better in all tested configurations, achieving improvements up to 10.3%. Profiling indicates that the improvement is due to the efficient I/O pipeline. The performance differences depend on the specific model, rather than on the model class (RNN/CNN). Both systems offer good scalability up to 4 GPUs. In certain cases there is a significant difference in performance depending on exactly which GPUs are used for computations.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"61 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122387718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Queue Waiting Time Prediction for Large-scale High-performance Computing System","authors":"Ju-Won Park","doi":"10.1109/HPCS48598.2019.9188119","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188119","url":null,"abstract":"Traditionally, high-performance computing (HPC) systems have been extensively utilized in many science fields including big data analysis and machine learning. Such large-scale HPC resources typically use the queue management systems which prefer space-sharing method to allocate resources. In space-sharing method, it is natural that a queue waiting time occurs until the resources are available if resources are not sufficient. When the predicted information on such a waiting time is provided, it is possible to improve the performance of scheduler. For this, in this paper, we propose a prediction method of queue waiting time based on the job log file created in the HPC system actually under operation. The prediction technique presented in this paper largely comprises three phases. The first phase is a pre-processing of data in constant time intervals. In the second phase, major features are selected through a factor analysis and clustering is conducted based on the selected features. In the third phase, a waiting time of the next job is predicted using the sliding window method based on the jobs that were clustered.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114132193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christoph Ostrau, Christian Klarhorst, Michael Thies, U. Rückert
{"title":"Comparing Neuromorphic Systems by Solving Sudoku Problems","authors":"Christoph Ostrau, Christian Klarhorst, Michael Thies, U. Rückert","doi":"10.1109/HPCS48598.2019.9188207","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188207","url":null,"abstract":"In the field of neuromorphic computing several hardware accelerators for spiking neural networks have been introduced, but few studies actually compare different systems. These comparative studies reveal difficulties in porting an existing network to a specific system and in predicting its performance indicators. Finding a common network architecture that is suited for all target platforms and at the same time yields decent results is a major challenge. In this contribution, we show that a winner-takes-all inspired network structure can be employed to solve Sudoku puzzles on three diverse hardware accelerators. By exploring several network implementations, we measured the number of solved puzzles in a set of 100 assorted Sudokus, as well as time and energy to solution. Concerning the last two indicators, our measurements indicate that it can be beneficial to port a network to an analogue hardware system.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128364833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DP2: A Highly Parallel Range Join for Genome Analysis on Distributed Computing Platform","authors":"Aman Sinha, B. Lai","doi":"10.1109/HPCS48598.2019.9188222","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188222","url":null,"abstract":"Rapid growth of the sheer amount of genome data and intense computation become great challenges for downstream genome analytics. Efficient parallel processing and distributed computing are the two effective schemes to address the analysis of big data. Range join is a widely used, effective, yet time-consuming operation that finds the overlap between two different sets of genome features. The current widely adopted BEDTools [6] pipeline adopts single-node binary tree approach, while the distributed GenAp scheme fails to exploit the massive parallel computation on modern throughput processors, such as GPU (Graphic Processing Unit). This paper proposes a novel Distributed Parallel P-ary search (DP2) that applies novel P-ary analysis to enable high parallelism at algorithmic level, and extensively utilize multiple GPUs at system and architecture level. Efficient computation allocation is implemented to leverage the distributed computing on clusters. The proposed framework can be well integrated with current BEDTools [6] pipeline, and achieves an average of 25x speedup for the actual range-join operation when compared with Binary tree approach of GenAp and a 13x end-to-end (total execution time) speedup in comparison to ADAM.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124650044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}