{"title":"Dynamic Reconfiguration of Data Parallel Programs","authors":"Vinícius Dias, Wagner Meira Jr, D. Guedes","doi":"10.1109/SBAC-PAD.2016.32","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.32","url":null,"abstract":"Given the large amount of data from different sources that have become available to researchers in multiple fields, Data Science has emerged as a new paradigm for exploring and getting value from that data. In that context, new parallel processing environments with abstract programming interfaces, like Spark, were proposed to try to simplify the development of distributed programs. Although such solutions have become widely used, achieving the best performance with them is still not always straight-forward, despite the multiple run-time strategies they use. In this work we analyze some of the causes of performance degradation in such systems and, based on that analysis, we propose a tool to improve performance by dynamically adjusting data partitioning and parallelism degree in recurrent applications based on previous executions. Our results applying that methodology show consistent reductions in execution time for the applications considered, with gains of up to 50%.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"60 20","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134224791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Parallelization of a Simulated Annealing Approach for 0-1 Multidimensional Knapsack Problem Using GPGPU","authors":"Bianca de Almeida Dantas, E. Cáceres","doi":"10.1109/SBAC-PAD.2016.25","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.25","url":null,"abstract":"In the last decades, with the advances in multicore/manycore architectures, it became interesting to design algorithms which can take advantage of such architectures aiming the achievement of more efficient algorithms to solve difficult problems. A large number of real-world problems solved with the help of computer programs demand faster or better quality solutions. Some of these problems can be modeled as classical theoretical problems, such as the 0-1 multidimensional knapsack problem (0-1 MKP), known to belong to the NP-hard class of problems, for which we can not obtain an exact solution efficiently. This motivates the search for alternative strategies which can achieve good quality approximate solutions, like metaheuristics, and also different ways to enable their execution in reduced times, such as parallel algorithms which explore multicore/manycore architectures. In this work we describe a parallelization of a simulated annealing approachusing GPGPU to solve 0-1 MKP and compare our results to previous works in order to prove the viability of its use.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134330441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Planning Your SQL-on-Hadoop Deployment Using a Low-Cost Simulation-Based Approach","authors":"Jun Liu, Bianny Bian, Samantika Sury","doi":"10.1109/SBAC-PAD.2016.31","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.31","url":null,"abstract":"The term \"SQL-on-Hadoop\" has recently gained significant traction [19]. Impala represents a new emerging class of SQL-on-Hadoop systems that exploit a shared-nothing parallel database architecture over Hadoop. Impala was designed to close the gap of near real time data analytics on Hadoop stack and it has shown itself to be significantly more efficient than other SQL-on-Hadoop solutions [13]. However, it is not a trivial task to leverage Impala for handling queries with different business demands [12]. Improperly deploying an Impala cluster may not give you the expected performance you want. In this paper, we propose a novel Impala simulation framework to help IT professionals to understand its performance behavior. This would simplify the deployment planning work required to enable big data analytics on SQL-on-Hadoop systems. An Impala simulator models the behavior of a complete software stack and simulates the activities of cluster components such as storage, network, processors and memory. Moreover, the accuracy of the simulation remain high in response to both software configuration and hardware changes, it reflects the expected scaling trend with low cost overhead and fast simulation speed. The Impala simulator has been validated against various S/W and H/W configurations, using the well-known TPC-DS benchmark [15], and the simulation results are valid and expected. A use case is provided to show how one would use the simulator to solve their performance and deployment issues.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129841443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Study of Power-Performance Modeling Using a Domain-Specific Language","authors":"M. Umar, J. Meredith, J. Vetter, K. Cameron","doi":"10.1109/SBAC-PAD.2016.19","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.19","url":null,"abstract":"Energy use is now a first-class design constraint in high-performance systems and applications. Improving our understanding of application energy consumption in diverse, heterogeneous systems will be essential to efficient operation. For example, power limits in large scale parallel and distributed systems will require optimizing performance under energy constraints. However, with increased levels of parallelism, complex memory hierarchies, hardware heterogeneity, and diverse programming models and interfaces, improving performance and energy efficiency simultaneously is exceedingly difficult. Our thesis is that estimating energy use, either a priori or as soon as possible at runtime, will be essential to future systems. Such estimates must adapt with changes in applications across hardware configurations. Existing approaches offer insight and detail, but typically are too cumbersome to enable adaptation at runtime or lack portability or accuracy. To overcome these limitations, we propose two energy estimation techniques which use the Aspen domain specific language for performance modeling: ACEE (Algorithmic and Categorical Energy Estimation), a combination of analytical and empirical modeling techniques embedded in a runtime framework that leverages Aspen, and AEEM (Aspen's Embedded Energy Modeling), a system level coarse-grained energy estimation technique that uses performance modeling from Aspen to generate energy estimations at runtime. This paper presents methodology of the models and examines their accuracy as well as their advantages and challenges in several use cases.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127243871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rajiv Nishtala, X. Martorell, V. Petrucci, D. Mossé
{"title":"REPP-H: Runtime Estimation of Power and Performance on Heterogeneous Data Centers","authors":"Rajiv Nishtala, X. Martorell, V. Petrucci, D. Mossé","doi":"10.1109/SBAC-PAD.2016.27","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.27","url":null,"abstract":"One of the main challenges in data center systems is operating under certain Quality of Service (QoS) while minimizing power consumption. Increasingly, data centers are adopting heterogeneous server architectures with different power-performance trade-offs. This requires careful understanding of the application behavior across multiple architectures at runtime so as to enable meeting specified power and performance requirements. In this work, we present and evaluate REPP-H (Runtime Estimation of Performance and Power on Heterogeneous data centers). REPP-H leverages hardware performance counters available on all major server architectures to ensure a highly responsive power capping mechanism and delivering a minimum performance in a single step. We experimentally show that REPP-H can successfully estimate power and performance of several single-threaded andmultiprogrammed workloads. The average errors on ARM, AMD and Intel architectures are, respectively, 7.1%, 9.0%, 7.1% when predicting performance, and 6.0%, 6.5%, 8.1% when predicting power on those heterogeneous servers.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115123348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. C. Januario, Bryan S. Rosenburg, Yoonho Park, M. Perrone, J. Moreira, T. Carvalho
{"title":"Speeding Up Stencil Computations with Kernel Convolution","authors":"G. C. Januario, Bryan S. Rosenburg, Yoonho Park, M. Perrone, J. Moreira, T. Carvalho","doi":"10.1109/SBAC-PAD.2016.18","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.18","url":null,"abstract":"A technique to speed up stencil computation is introduced. Computation and data reuse schemes are developed for its application to 1- and 3-dimensional stencils. The approach traverses the data domain fewer times than a state-of-the-art, straightforward iterative stencil implementation would. Performance results are shown for a variety of platforms, exemplifying how it can be straightforwardly applied with existing techniques and frameworks. The technique, named Aggregate Stencil-Loop Iteration (ASLI), works by applying a stencil obtained by the original stencil operator convolved with itself one or more times. This more complex operator creates new opportunities for in-register data reuse and increases the FLOPs-to-load ratio. The total number of FLOPs decreases for 1D but increases for 2D and 3D star-shaped stencils. In both scenarios, speed-up relative to the state-of-the-art is achieved. ASLI is relatively easy to implement and works synergistically with existing methods to optimize stencil computations.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125822008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. C. D. Moura, Giovane O. Torres, M. Pilla, L. Pilla, Amarildo T. da Costa, F. França
{"title":"Value Reuse Potential in ARM Architectures","authors":"R. C. D. Moura, Giovane O. Torres, M. Pilla, L. Pilla, Amarildo T. da Costa, F. França","doi":"10.1109/SBAC-PAD.2016.30","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.30","url":null,"abstract":"Code execution in modern superscalar processors is inherently redundant. Many instructions execute repeatedly with the same inputs, producing the same outputs, thus wasting resources in the process. Value reuse techniques memorize previous executions of instructions, blocks or traces which may be reused if they appear again with the same input contexts. Although trace reuse techniques show great potential for both performance and energy consumption improvement, they have not been studied yet in one of the most widely available computer architectures - the ARM architecture. In this paper, the main issues with reusing traces in instruction sets with conditional execution are revisited. Afterwards, the reuse potential in the benchmark suite MiBench is analyzed varying (i) how traces are generated, and (ii) the size of reuse tables. Our results show that a memoization table of 32 KiB allows to reuse 18.36% of the total instructions on average.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"318 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132319623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeremy Benson, Trilce Estrada, A. Rosenberg, M. Taufer
{"title":"Scheduling Matters: Area-Oriented Heuristic for Resource Management","authors":"Jeremy Benson, Trilce Estrada, A. Rosenberg, M. Taufer","doi":"10.1109/SBAC-PAD.2016.35","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.35","url":null,"abstract":"Parallel and distributed systems that provide compute resources on demand are convenient, cost-effective, and becoming increasingly common. Boosting workload performance in such environments through scheduling has been of great interest, as users and providers aim to increase parallelism and reduce execution times. For modern data centers, leaving a smaller carbon footprint while maintaining high performance and low cost is becoming the next big challenge. With this in mind, we analyze the relative impacts on resource utilization of three well-motivated platform-oblivious scheduling heuristics. We simulate over 50,000 DAG workflow executions and measure performance, cost, and resource utilization under the three scheduling heuristics. Our results provide insights to better enable high-performance execution of workflows and advanced capacity planning while increasing resource utilization and reducing costs.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125751298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth Surfaces","authors":"Travis Johnston, Connor Zanin, M. Taufer","doi":"10.1109/SBAC-PAD.2016.12","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.12","url":null,"abstract":"The number and diversity of tunable parameters in applications makes predicting settings that achieve optimal performance challenging. Complicating matters is the fact that resources are increasingly shared among computational tasks (for example, in cloud environments). Choosing any setting that yields near-optimal performance runs the risk of overusing shared resources. Building accurate models that capture the complicated interplay of parameters is crucial in order to maximize performance with minimal resource impact. Traditional techniques tend to fall short when modeling performance. One reason is that performance surfaces are often irregular but most traditional techniques are designed to produce smooth models. In this paper we introduce a hybrid modeling technique that combines the strengths of surrogate-based modeling (SBM) and k nearest-neighbor regression (kNN) into a single method called HYPPO. The hybrid method is a piecewise polynomial model composed of many small, local models. We demonstrate that HYPPO significantly improves overall prediction accuracy compared with SBM and kNN.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115192340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Mendonca, B. Guimarães, P. Alves, Fernando Magno Quintão Pereira, M. Pereira, G. Araújo
{"title":"Automatic Insertion of Copy Annotation in Data-Parallel Programs","authors":"G. Mendonca, B. Guimarães, P. Alves, Fernando Magno Quintão Pereira, M. Pereira, G. Araújo","doi":"10.1109/SBAC-PAD.2016.13","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.13","url":null,"abstract":"Directive-based programming models, such as OpenACC and OpenMP arise today as promising techniques to support the development of parallel applications. These systems allow developers to convert a sequential program into a parallel one with minimum human intervention. However, inserting pragmas into production code is a difficult and error-prone task, often requiring familiarity with the target program. This difficulty restricts the ability of developers to annotate code that they have not written themselves. This paper provides one fundamental component in the solution of this problem. We introduce a static program analysis that infers the bounds of memory regions referenced in source code. Such bounds allow us to automatically insert data-transfer primitives, which are needed when the parallelized code is meant to be executed in an accelerator device, such as a GPU. To validate our ideas, we have applied them onto Polybench, using two different architectures: Nvidia and Qualcomm-based. We have successfully analyzed 98% of all the memory accesses in Polybench. This result has enabled us to insert automatic annotations into those benchmarks leading to speedups of over 100x.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114802426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}