{"title":"Configuring Graph Traversal Applications for GPUs: Analysis of Implementation Strategies and their Correlation with Graph Characteristics","authors":"F. Busato, N. Bombieri","doi":"10.1109/HPCS48598.2019.9188204","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188204","url":null,"abstract":"Implementing a graph traversal (GT) algorithm for GPUs is a very challenging task. It is a core primitive for many graph analysis applications and its efficiency strongly impacts on the overall application performance. Different strategies have been proposed to implement the GT algorithm by exploiting the GPU characteristics. Nevertheless, the efficiency of each of them strongly depends on the graph characteristics. This paper presents an analysis of the most important features of the parallel GT algorithm, which include frontier queue management, load balancing, duplicate removing, and synchronization during graph traversal iterations. It shows different techniques to implement each of such features for GPUs and the comparison of their performance when applied on a very large and heterogeneous set of graphs. The results allow identifying, for each feature and among different implementation techniques of them, the best configuration to address the graph characteristics. The paper finally presents how such a configuration analysis and set allow traversing graphs with throughput up to 14,000 MTEPS on single GPU devices, with speedups ranging from 1.2x to 18.5x with regard to the best parallel applications for GT on GPUs at the state of the art.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"73 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126108310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Ojika, A. Gordon-Ross, H. Lam, Shinjae Yoo, Younggang Cui, Zhihua Dong, K. V. Dam, Seyong Lee, T. Kurth
{"title":"PCS: A Productive Computational Science Platform","authors":"David Ojika, A. Gordon-Ross, H. Lam, Shinjae Yoo, Younggang Cui, Zhihua Dong, K. V. Dam, Seyong Lee, T. Kurth","doi":"10.1109/HPCS48598.2019.9188108","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188108","url":null,"abstract":"As modern supercomputers continue to be increasingly heterogeneous with diverse computational accelerators (graphics processing units (GPUs), fieldprogrammable gate arrays (FPGAs), application specific integrated circuits (ASICs), etc.), software becomes a critical design aspect. Exploiting this new computational power requires increased software design time and effort to make valuable scientific discovery in the face of the complicated programming environments introduced by these accelerators. To address these challenges, we propose unifying multiple programming models into a single programming environment to facilitate large-scale, accelerator-aware, heterogeneous computing for next-generation scientific applications. This paper presents PCS, a productive computational science platform for cluster-scale heterogeneous computing. Focusing FPGAs, we describe the key concepts of the PCS platform and differentiate PCS from the current state-of-the-art, propose a new multi-FPGA architecture for graph-centric workloads (e.g., deep learning, etc.) with discussions on ongoing work.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127123818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Reliability of Compute Environments on Amazon EC2 Spot Instances","authors":"Altino M. Sampaio, Jorge G. Barbosa","doi":"10.1109/HPCS48598.2019.9188116","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188116","url":null,"abstract":"Amazon Elastic Compute Cloud (EC2) gives access to resources in the form of virtual servers, also known as instances. EC2 Spot Instances (SIs) offer spare compute capacity at steep discounts compared to reliable and fixed price on-demand instances. The drawback, however, is that waiting time until requested spots become fulfilled can be incredible high. In this paper, we propose a container migration-based solution to enhance the reliability of virtual cluster computing environments built on top of non-reserved EC2 pricing model instances. We compare the performance of our algorithm by executing different resource provisioning plans for running real-life workflow applications, constrained by user-defined deadline and budget Quality of Service (QoS) parameters. The results show that our solution is able to successfully conclude almost 98% of workflow applications and more than 99% of workflow tasks for on-demand- and spot block-based virtual compute environments. For SI-based virtual compute environments, our solution achieves similar results, completing more than 98% of workflow applications, and over 99% of workflow tasks, for a worse-case scenario.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127533251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Parameter Performance Modeling using Symbolic Regression","authors":"Sai P. Chenna, G. Stitt, H. Lam","doi":"10.1109/HPCS48598.2019.9188202","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188202","url":null,"abstract":"Performance modeling is becoming critically important due to the need for design-space exploration on emerging exascale architectures. Existing modeling and prediction approaches are either restricted by a limited number of parameters, or provide extreme tradeoffs between simulation performance and modeling accuracy that are not ideal for exascale simulations. At one extreme are low-level discrete-event simulators, which provide high accuracy, but are prohibitively slow for large-scale simulations. At the opposite extreme are abstract modeling approaches that are sufficiently fast, but tend to support a limited number of parameters, while also lacking accuracy due to machine-specific behaviors that deviate from anticipated models. In this paper, we improve upon existing abstract modeling approaches by leveraging symbolic regression to automatically discover an underlying multi-parameter model of the system and application that captures difficult-to-understand behaviors. For three High Performance Computing (HPC) applications running on Vulcan, we show that symbolic regression provided modeling accuracies that were $3.5 times, 4.6 times$, and $6.2 times$ better than analytical models developed using linear regression.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127290840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Feedback-Based Resource Allocation for Batch Scheduling of Scientific Workflows","authors":"Carl Witt, Dennis Wagner, U. Leser","doi":"10.1109/HPCS48598.2019.9188055","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188055","url":null,"abstract":"A scientific workflow is a set of interdependent compute tasks orchestrating large scale data analyses or in-silico experiments. Workflows often comprise thousands of tasks with heterogeneous resource requirements that need to be executed on distributed resources. Many workflow engines solve parallelization by submitting tasks to a batch scheduling system, which requires resource usage estimates that have to be provided by users. We investigate the possibility to improve upon inaccurate user estimates by incorporating an online feedback loop between workflow scheduling, resource usage prediction, and measurement.Our approach can learn resource usage of arbitrary type; in this paper, we demonstrate its effectiveness by predicting peak memory usage of tasks, as it is an especially sensitive resource type that leads to task termination if underestimated and leads to decreased throughput if overestimated.We compare online versions of standard machine learning models for peak memory usage prediction and analyze their interactions with different workflow scheduling strategies. By means of extensive simulation experiments, we found that the proposed feedback mechanism improves resource utilization and execution times compared to typical user estimates.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125253992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Wang, Jannis Klinkenberg, D. Ellsworth, C. Terboven, Matthias S. Müller
{"title":"Performance Prediction for Power-Capped Applications based on Machine Learning Algorithms","authors":"Bo Wang, Jannis Klinkenberg, D. Ellsworth, C. Terboven, Matthias S. Müller","doi":"10.1109/HPCS48598.2019.9188144","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188144","url":null,"abstract":"Growing high performance computing (HPC) clusters are encountering a power wall due to limitations in the surrounding infrastructure. Maximizing a cluster’s performance in the presence of a limited power budget is an open problem with high relevance and requires a deep understanding of application performance and power draw.Hardware components with the same technical specification have distinct power efficiencies and applications running on those components have diverse power profiles. Enforcing a power limit on individual components changes the performance characteristics. In this work, we investigate and quantity power- and performance-characteristics of various applications. Further, we present a systematic methodology to collect corresponding monitoring data and apply machine learning (ML) techniques to predict the performance under particular power caps. The observed prediction error is under 3% in most cases, which is in the same range of performance variation as application runs without a power cap.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114810503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy performances of a routing protocol based on fuzzy logic approach in an underwater wireless sensor networks","authors":"Hajar Bennouri, A. Berqia","doi":"10.1109/HPCS48598.2019.9188061","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188061","url":null,"abstract":"Underwater Wireless sensor Networks (UWSNs) is one of the promising topics in wireless communications. High transmission power and lengthy data packet transmission consume a significant amount of energy due to the difficult type of communication in this environment. Several methods are used to solve or reduce this problem. In this article, we are interested to study the impact of using fuzzy logic approach in a routing protocol to evaluate the energy performance of an underwater wireless sensor network. We implement the FLOVP (Fuzzy Logic optimized Vector Protocol) routing protocol which is an improved version of the Vector Based Forwarder VBF routing protocol in aqua-sim simulator for underwater wireless sensor based on NS2 to compare its performance in term of energy consumed with the original VBF routing protocol.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"0 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114917029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cost Reduction Bounds of Proactive Management Based on Request Prediction","authors":"R. Milocco, P. Minet, É. Renault, S. Boumerdassi","doi":"10.1109/HPCS48598.2019.9188199","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188199","url":null,"abstract":"Data Centers (DCs) need to manage their servers periodically to meet user demand efficiently. Since the cost of the energy employed to serve the user demand is lower when DC settings (e.g. number of active servers) are done a priori (proactively), there is a great interest in studying different proactive strategies based on predictions of requests. The amount of savings in energy cost that can be achieved depends not only on the selected proactive strategy but also on the statistics of the demand and the predictors used. Despite its importance, due to the complexity of the problem it is difficult to find studies that quantity the savings that can be obtained. The main contribution of this paper is to propose a generic methodology to quantity the possible cost reduction using proactive management based on predictions. Thus, using this method together with past data it is possible to quantity the efficiency of different predictors as well as optimize proactive strategies. In this paper, the cost reduction is evaluated using both ARMA (Auto Regressive Moving Average) and LV (Last Value) predictors. We then apply this methodology to the Google dataset collected over a period of 29 days to evaluate the benefit that can be obtained with those two predictors in the considered DC.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127661621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Maier, Nadjib Mammeri, Biagio Cosenza, B. Juurlink
{"title":"Approximating Memory-bound Applications on Mobile GPUs","authors":"Daniel Maier, Nadjib Mammeri, Biagio Cosenza, B. Juurlink","doi":"10.1109/HPCS48598.2019.9188051","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188051","url":null,"abstract":"Approximate computing techniques are often used to improve the performance of applications that can tolerate some amount of impurity in the calculations or data. In the context of embedded and mobile systems, a broad number of applications have exploited approximation techniques to improve performance and overcome the limited capabilities of the hardware. On such systems, even small performance improvements can be sufficient to meet scheduled requirements such as hard real-time deadlines. We study the approximation of memory-bound applications on mobile GPUs using kernel perforation, an approximation technique that exploits the availability of fast GPU local memory to provide high performance with more accurate results. Using this approximation technique, we approximated six applications and evaluated them on two mobile GPU architectures with very different memory layouts: a Qualcomm Adreno 506 and an ARM Mali T860 MP2. Results show that, even when the local memory is not mapped to dedicated fast memory in hardware, kernel perforation is still capable of $1.25times$ speedup because of improved memory layout and caching effects. Mobile GPUs with local memory show a speedup of up to $1.38times$.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"151 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132904254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Bleile, P. Brantley, D. Richards, S. Dawson, M. S. McKinley, M. O’Brien, H. Childs
{"title":"Thin-Threads: An Approach for History-Based Monte Carlo on GPUs","authors":"R. Bleile, P. Brantley, D. Richards, S. Dawson, M. S. McKinley, M. O’Brien, H. Childs","doi":"10.1109/HPCS48598.2019.9188080","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188080","url":null,"abstract":"A graphics processing unit (GPU) has become a core technology for modern supercomputers. Applications that once ran on supercomputers are being forced to make significant changes to their designs to utilize these new machines. This paper introduces the concept of Thin-Threads as a method for history-based Monte Carlo transport applications on GPUs. The key principles behind Thin-Threads are light memory usage and communication and managing data race issues via atomics. We show that we can achieve a 10x speedup when moving from the traditional method to Thin-Threads on GPUs. Additionally, we demonstrate the viability of the Thin-Threads model at scale for GPU and CPU platforms.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127495885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}