{"title":"OpenCL Performance Prediction using Architecture-Independent Features","authors":"Beau Johnston, G. Falzon, Josh Milthorpe","doi":"10.1109/HPCS.2018.00095","DOIUrl":"https://doi.org/10.1109/HPCS.2018.00095","url":null,"abstract":"OpenCL is an attractive programming model for heterogeneous high-performance computing systems, with wide support from hardware vendors and significant performance portability. To support efficient scheduling on HPC systems it is necessary to perform accurate performance predictions for OpenCL workloads on varied compute devices, which is challenging due to diverse computation, communication and memory access characteristics which result in varying performance between devices. The Architecture Independent Workload Characterization (AIWC) tool can be used to characterize OpenCL kernels according to a set of architecture-independent features. This work presents a methodology where AIWC features are used to form a model capable of predicting accelerator execution times. We used this methodology to predict execution times for a set of 37 computational kernels running on 15 different devices representing a broad range of CPU, GPU and MIC architectures. The predictions are highly accurate, differing from the measured experimental run-times by an average of only 1.2%, and correspond to actual execution time mispredictions of 9 ps to 1 sec according to problem size. A previously unencountered code can be instrumented once and the AIWC metrics embedded in the kernel, to allow performance prediction across the full range of modelled devices. The results suggest that this methodology supports correct selection of the most appropriate device for a previously unen- countered code, which is highly relevant to the HPC scheduling setting.","PeriodicalId":308138,"journal":{"name":"2018 International Conference on High Performance Computing & Simulation (HPCS)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124099753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Mazeh, Bilal Hammoud, H. Ayad, F. Ndagijimana, G. Faour, M. Fadlallah, J. Jomaah
{"title":"Snow Depth Retrieval Algorithm from Radar Backscattering Measurements at L- and X- Band Using Multi-Incidence Angles","authors":"F. Mazeh, Bilal Hammoud, H. Ayad, F. Ndagijimana, G. Faour, M. Fadlallah, J. Jomaah","doi":"10.1109/HPCS.2018.00021","DOIUrl":"https://doi.org/10.1109/HPCS.2018.00021","url":null,"abstract":"The objective of this work is to develop an algorithm to estimate snow thickness over ground from backscattering measurements at L- and X-band (1.5 and 10 GHz) using multi incidence angles (0°, 10° and 30°). The return signal from the medium is due to the ground roughness, the snow volume, and the noise from the radar system. So, surface and volume scattering effects are modeled from physics forward models, and noise effects are modeled by including a white Gaussian noise into the simulation. This inversion algorithm involves two steps. The first is to estimate snow density using L-band co-polarized backscattering coefficient. The second is to estimate the snow depth from X-band co-polarized backscattering coefficients using dual incidence angles. For a 0.02 noise variance, all retrieved values have an error less than 2% for a snow depth range of [50-300] cm.","PeriodicalId":308138,"journal":{"name":"2018 International Conference on High Performance Computing & Simulation (HPCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124123541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anderson M. Maliszewski, Dalvan Griebler, C. Schepke, Alexander Ditter, D. Fey, L. G. Fernandes
{"title":"The NAS Benchmark Kernels for Single and Multi-Tenant Cloud Instances with LXC/KVM","authors":"Anderson M. Maliszewski, Dalvan Griebler, C. Schepke, Alexander Ditter, D. Fey, L. G. Fernandes","doi":"10.1109/HPCS.2018.00066","DOIUrl":"https://doi.org/10.1109/HPCS.2018.00066","url":null,"abstract":"Private IaaS clouds are an attractive environment for scientific workloads and applications. It provides advantages such as almost instantaneous availability of high-performance computing in a single node as well as compute clusters, easy access for researchers, and users that do not have access to conventional supercomputers. Furthermore, a cloud infrastructure provides elasticity and scalability to ensure and manage any software dependency on the system with no third-party dependency for researchers. However, one of the biggest challenges is to avoid significant performance degradation when migrating these applications from physical nodes to a cloud environment. Also, we lack more research investigations for multi-tenant cloud instances. In this paper, our goal is to perform a comparative performance evaluation of scientific applications with single and multi-tenancy cloud instances using KVM and LXC virtualization technologies under private cloud conditions. All analyses and evaluations were carried out based on NAS Benchmark kernels to simulate different types of workloads. We applied statistic significance tests to highlight the differences. The results have shown that applications running on LXC-based cloud instances outperform KVM-based cloud instances in 93.75% of the experiments w.r.t single tenant. Regarding multi-tenant, LXC instances outperform KVM instances in 45% of the results, where the performance differences were not as significant as expected.","PeriodicalId":308138,"journal":{"name":"2018 International Conference on High Performance Computing & Simulation (HPCS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115807160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Examining Energy Efficiency of Vectorization Techniques Using a Gaussian Elimination","authors":"T. Jakobs, G. Rünger","doi":"10.1109/HPCS.2018.00054","DOIUrl":"https://doi.org/10.1109/HPCS.2018.00054","url":null,"abstract":"Modern computer environments are limited by energy and power constraints during the execution of programs. These limits can be due to power lines, budgeting, ecology, battery life or many other reasons. To bypass these limits, hardware and software development strive to reduce the energy and power consumption of the execution of algorithms. This article investigates the capabilities and limitations of vectorization with respect to energy efficiency. Vectorization is a technique to instrument on-chip SIMD execution that increases the performance of programs. The capability of vectorization to reduce energy consumption of programs has to be shown. As applicative algorithm, the well-known Gaussian elimination is vectorized and investigated. Several implementations, including automatic and manual vectorization techniques, have been developed and their execution time and energy consumption have been measured and compared.","PeriodicalId":308138,"journal":{"name":"2018 International Conference on High Performance Computing & Simulation (HPCS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132014650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cristobal Ortega, Victor Garcia, Miquel Moretó, Marc Casas, Roxana Rusitoru
{"title":"Data Prefetching on In-order Processors","authors":"Cristobal Ortega, Victor Garcia, Miquel Moretó, Marc Casas, Roxana Rusitoru","doi":"10.1109/HPCS.2018.00061","DOIUrl":"https://doi.org/10.1109/HPCS.2018.00061","url":null,"abstract":"Low-power processors have attracted attention due to their energy-efficiency. A large market, such as the mobile one, relies on these processors for this very reason. Even High Performance Computing (HPC) systems are starting to consider low-power processors as a way to achieve exascale performance within 20MW, however, they must meet the right performance/Watt balance. Current low-power processors contain in-order cores, which cannot re-order instructions to avoid data dependency-induced stalls. Whilst this is useful to reduce the chip's total power consumption, it brings several challenges. Due to the evolving performance gap between memory and processor, memory is a significant bottleneck. In-order cores cannot re-order instructions and are memory latency bound, something data prefetching can help alleviate by ensuring data is readily available. In this work, we do an exhaustive analysis of available data prefetching techniques in state-of-the-art in-order cores. We analyze 5 static prefetchers and 2 dynamic aggressiveness and destination mechanisms applied to 3 data prefetchers on a set of HPC mini- and proxy-applications, whilst running on in-order processors. We show that next-line prefetching can achieve nearly top performance with a reasonable bandwidth consumption when throttled, whilst neighbor prefetchers have been found to be best, overall.","PeriodicalId":308138,"journal":{"name":"2018 International Conference on High Performance Computing & Simulation (HPCS)","volume":"48 33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132332905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Roofline Scaling Trajectories: A Method for Parallel Application and Architectural Performance Analysis","authors":"K. Ibrahim, Samuel Williams, L. Oliker","doi":"10.1109/HPCS.2018.00065","DOIUrl":"https://doi.org/10.1109/HPCS.2018.00065","url":null,"abstract":"The end of Dennard scaling signaled a shift in HPC supercomputer architectures from systems built from single- core processor architectures to systems built from multicore and eventually manycore architectures. This transition substantially complicated performance optimization and analysis as new programming models were created, new scaling methodologies deployed, and on-chip contention became a bottleneck to performance. Existing distributed memory performance models like logP and logGP were unable to capture this contention. The Roofline model was created to address this contention and its interplay with locality. However, to date, the Roofline model has focused on full-node concurrency. In this paper, we extend the Roofline model to capture the effects of concurrency on data locality and on-chip contention. We demonstrate the value of this new technique by evaluating the NAS parallel benchmarks on both multicore and manycore architectures under both strong-and weak-scaling regimes. In order to quantify the interplay between programming model and locality, we evaluate scaling under both the OpenMP and flat MPI programming models.","PeriodicalId":308138,"journal":{"name":"2018 International Conference on High Performance Computing & Simulation (HPCS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133622198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Arroyo, Sandra Gómez Canaval, V. Mitrana, M. Păun, José-Ramón Sánchez-Couso
{"title":"Towards Probabilistic Networks of Polarized Evolutionary Processors","authors":"F. Arroyo, Sandra Gómez Canaval, V. Mitrana, M. Păun, José-Ramón Sánchez-Couso","doi":"10.1109/HPCS.2018.00123","DOIUrl":"https://doi.org/10.1109/HPCS.2018.00123","url":null,"abstract":"The aim of this paper is to discuss two possible ways of introducing some features based on probabilistic concepts and methods in networks of polarized evolutionary processors (NPEP). We associate probabilities with rules in every node such that together with the communication protocol, which is based on the compatibility between the polarization of each node and data navigating through the network, might facilitate the study of biological phenomena as well as software simulations or hardware implementations. The probability associated with rules may be a priori defined and fixed or may be computed dynamically. Probabilities will also appear when communicating data between nodes; these probabilities may be statically or dynamically defined. This note also proposes the study of the impact of these characteristics and see how these new features reduce the gap between the formal model and its practical applicability. Introducing probabilities in NPEP is aimed to decrease the exponential expansion of the number of strings which appear in the computations used to solve NP-problems in a polynomial time. A decreasing of the exponential expansion of this number is achieved with a loss of certainty of the final result which is reached with some error probability.","PeriodicalId":308138,"journal":{"name":"2018 International Conference on High Performance Computing & Simulation (HPCS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134464102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yassine Motie, Elhadi Belghache, A. Nketsa, J. Georgé
{"title":"Interoperability Based Dynamic Data Mediation using Adaptive Multi-Agent Systems for Co-Simulation","authors":"Yassine Motie, Elhadi Belghache, A. Nketsa, J. Georgé","doi":"10.1109/HPCS.2018.00050","DOIUrl":"https://doi.org/10.1109/HPCS.2018.00050","url":null,"abstract":"A co-simulation is the coupling of several simulation tools where each one handles part of a modular problem which allows each designer to interact with the complex system in order to retain its business expertise and continue to use its own digital tools. For this co-simulation to work, the ability to exchange data between the tools in meaningful ways, known as Interoperability, is required. This paper describes the design of such interoperability based on the FMI (Functional Mock up Interface) standard and a dynamic data mediation using adaptive multi-agent systems for a co-simulation. It is currently being applied in neOCampus, the ambient campus of the University of Toulouse III - Paul Sabatier.","PeriodicalId":308138,"journal":{"name":"2018 International Conference on High Performance Computing & Simulation (HPCS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115410690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Convolutional Neural Networks on Embedded Automotive Platforms: A Qualitative Comparison","authors":"Gianluca Brilli, P. Burgio, M. Bertogna","doi":"10.1109/HPCS.2018.00084","DOIUrl":"https://doi.org/10.1109/HPCS.2018.00084","url":null,"abstract":"In the last decade, the rise of power-efficient, heterogeneous embedded platforms paved the way to the effective adoption of neural networks in several application domains. Especially, many-core accelerators (e.g., GPUs and FPGAs) are used to run Convolutional Neural Networks, e.g., in autonomous vehicles, and industry 4.0. At the same time, advanced research on neural networks is producing interesting results in computer vision applications, and NN packages for computer vision object detection and categorization such as YOLO, GoogleNet and AlexNet reached an unprecedented level of accuracy and performance. With this work, we aim at validating the effectiveness and efficiency of most recent networks on state-of-the-art embedded platforms, with commercial-off-the-shelf System-on-Chips such as the NVIDIA Tegra X2 and Xilinx Ultrascale+. In our vision, this work will support the choice of the most appropriate CNN package and computing system, and at the same time tries to \"make some order\" in the field.","PeriodicalId":308138,"journal":{"name":"2018 International Conference on High Performance Computing & Simulation (HPCS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115674550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Static Loop Parallelization Decision Using Template Metaprogramming","authors":"Alexis Pereda, D. Hill, C. Mazel, Bruno Bachelet","doi":"10.1109/HPCS.2018.00159","DOIUrl":"https://doi.org/10.1109/HPCS.2018.00159","url":null,"abstract":"This article proposes to use C++ template metaprogramming techniques to decide at compile-time which parts of a code sequence in a loop can be parallelized. The approach focuses on characterizing the way a variable is accessed in a loop (reading or writing); first to decide how the loop should be split to enable the analysis for parallelization on each part; and then to decide if the iterations inside each loop are independent so that they can be run in parallel. The conditions that enable the parallelization of a loop are first explained to justify the proposed decision algorithm exposed. Then; a C++ library-based solution is presented that uses expression templates to get the relevant information necessary for the parallelization decision of a loop; and metaprograms to decide whether to parallelize the loop and generate a parallel code.","PeriodicalId":308138,"journal":{"name":"2018 International Conference on High Performance Computing & Simulation (HPCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124344900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}