{"title":"Optimizing MapReduce for GPUs with effective shared memory usage","authors":"Linchuan Chen, G. Agrawal","doi":"10.1145/2287076.2287109","DOIUrl":"https://doi.org/10.1145/2287076.2287109","url":null,"abstract":"Accelerators and heterogeneous architectures in general, and GPUs in particular, have recently emerged as major players in high performance computing. For many classes of applications, MapReduce has emerged as the framework for easing parallel programming and improving programmer productivity. There have already been several efforts on implementing MapReduce on GPUs.\u0000 In this paper, we propose a new implementation of MapReduce for GPUs, which is very effective in utilizing shared memory, a small programmable cache on modern GPUs. The main idea is to use a reduction-based method to execute a MapReduce application. The reduction-based method allows us to carry out reductions in shared memory. To support a general and efficient implementation, we support the following features: a memory hierarchy for maintaining the reduction object, a multi-group scheme in shared memory to trade-off space requirements and locking overheads, a general and efficient data structure for the reduction object, and an efficient swapping mechanism.\u0000 We have evaluated our framework with seven commonly used MapReduce applications and compared it with the sequential implementations, MapCG, a recent MapReduce implementation on GPUs, and Ji et al.'s work, a recent MapReduce implementation that utilizes shared memory in a different way. The main observations from our experimental results are as follows. For four of the seven applications that can be considered as reduction-intensive applications, our framework has a speedup of between 5 and 200 over MapCG (for large datasets). Similarly, we achieved a speedup of between 2 and 60 over Ji et al.'s work.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130469380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Leveraging renewable energy in data centers: present and future","authors":"R. Bianchini","doi":"10.1145/2287076.2287101","DOIUrl":"https://doi.org/10.1145/2287076.2287101","url":null,"abstract":"Interest has been growing in powering data centers (at least partially) with renewable or \"green\" sources of energy, such as solar or wind. However, it is challenging to use these sources because, unlike the \"brown\" (carbon-intensive) energy drawn from the electrical grid, they are not always available. In this keynote talk, I will first discuss the tradeoffs involved in leveraging green energy today and the prospects for the future. I will then discuss the main research challenges and questions involved in managing the use of green energy in data centers. Next, I will describe some of the software and hardware that researchers are building to explore these challenges and questions. Specifically, I will overview systems that match a data center's computational workload to the green energy supply. I will also describe Parasol, the solar-powered micro-data center we have just built at Rutgers University. Finally, I will discuss some potential avenues for future research on this topic.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114249151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Chen, Danilo Ansaloni, E. Smirni, A. Yokokawa, Walter Binder
{"title":"Achieving application-centric performance targets via consolidation on multicores: myth or reality?","authors":"L. Chen, Danilo Ansaloni, E. Smirni, A. Yokokawa, Walter Binder","doi":"10.1145/2287076.2287083","DOIUrl":"https://doi.org/10.1145/2287076.2287083","url":null,"abstract":"Consolidation of multiple applications with diverse and changing resource requirements is common in multicore systems as hardware resources are abundant and opportunities for better system usage are plenty. Can we maximize resource usage in such a system while respecting individual application performance targets or is it an oxymoron to simultaneously meet such conflicting measures? In this work we provide a solution to the above difficult problem by constructing a queueing-theory based tool that we use to accurately predict application scalability on multicores and that can also provide the optimal consolidation suggestions to maximize system resource usage while meeting simultaneously application performance targets. The proposed methodology is light-weight and relies on capturing application resource demands using standard tools, via nonintrusive low-level measurements. We evaluate our approach on an IBM Power7 system using the DaCapo and SPECjvm benchmark suites where each benchmark exhibits different patterns of parallelism. From 900 different consolidations of application instances, our tool accurately predicts the average iteration time of allocated applications with an average error below 10%.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121207656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Min Li, Dinesh Subhraveti, A. Butt, Aleksandr Khasymski, P. Sarkar
{"title":"CAM: a topology aware minimum cost flow based resource manager for MapReduce applications in the cloud","authors":"Min Li, Dinesh Subhraveti, A. Butt, Aleksandr Khasymski, P. Sarkar","doi":"10.1145/2287076.2287110","DOIUrl":"https://doi.org/10.1145/2287076.2287110","url":null,"abstract":"MapReduce has emerged as a prevailing distributed computation paradigm for enterprise and large-scale data-intensive computing. The model is also increasingly used in the massively-parallel cloud environment, where MapReduce jobs are run on a set of virtual machines (VMs) on pay-as-needed basis. However, MapReduce jobs suffer from performance degradation when running in the cloud due to inefficient resource allocation. In particular, the MapReduce model is designed for and leverages information from the native clusters to operate efficiently, whereas the cloud presents a virtual cluster topology overlying or hiding actual network information. This results in two placement anomalies: loss of data locality and loss of job locality, where jobs are placed physically away from their data or other associated jobs, adversely affecting their performance.\u0000 In this paper we propose, CAM, a cloud platform that provides an innovative resource scheduler particularly designed for hosting MapReduce applications in the cloud. CAM reconciles both data and VM resource allocation with a variety of competing constraints, such as storage utilization, changing CPU load and network link capacities. CAM uses a flow-network-based algorithm that is able to optimize MapReduce performance under the specified constraints -- not only by initial placement, but by readjusting through VM and data migration as well. Additionally, our platform exposes, otherwise hidden, lower-level topology information to the MapReduce job scheduler so that it makes optimal task assignments. Evaluation of CAM using both micro-benchmarks and simulations on a 23 VM cluster shows that compared to a state-of-the-art resource allocator, our system reduces network traffic and average MapReduce job execution time by a factor of 3 and 8.6, respectively.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121278203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marc Gamell, I. Rodero, M. Parashar, R. Muralidhar
{"title":"Exploring cross-layer power management for PGAS applications on the SCC platform","authors":"Marc Gamell, I. Rodero, M. Parashar, R. Muralidhar","doi":"10.1145/2287076.2287113","DOIUrl":"https://doi.org/10.1145/2287076.2287113","url":null,"abstract":"High-performance parallel computing architectures are increasingly based on multi-core processors. While current commercially available processors are at 8 and 16 cores, technological and power constraints are limiting the performance growth of the cores and are resulting in architectures with much higher core counts, such as the experimental many-core Intel Single-chip Cloud Computer (SCC) platform. These trends are presenting new sets of challenges to HPC applications including programming complexity and the need for extreme energy efficiency.\u0000 In this paper, we first investigate the power behavior of scientific Partitioned Global Address Space (PGAS) application kernels on the SCC platform, and explore opportunities and challenges for power management within the PGAS framework. Results obtained via empirical evaluation of Unified Parallel C (UPC) applications on the SCC platform under different constraints, show that, for specific operations, the potential for energy savings in PGAS is large; and power/performance trade-offs can be effectively managed using a cross-layer approach. We investigate cross-layer power management using PGAS language extensions and runtime mechanisms that manipulate power/performance tradeoffs. Specifically, we present the design, implementation and evaluation of such a middleware for application-aware cross-layer power management of UPC applications on the SCC platform. Finally, based on our observations, we provide a set of insights that can be used to support similar power management for PGAS applications on other many-core platforms.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114850245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simon Delamare, G. Fedak, Derrick Kondo, O. Lodygensky
{"title":"SpeQuloS: a QoS service for BoT applications using best effort distributed computing infrastructures","authors":"Simon Delamare, G. Fedak, Derrick Kondo, O. Lodygensky","doi":"10.1145/2287076.2287106","DOIUrl":"https://doi.org/10.1145/2287076.2287106","url":null,"abstract":"Exploitation of Best Effort Distributed Computing Infrastructures (BE-DCIs) allow operators to maximize the utilization of the infrastructures, and users to access the unused resources at relatively low cost. Because providers do not guarantee that the computing resources remain available to the user during the entire execution of their applications, they offer a diminished Quality of Service (QoS) compared to traditional infrastructures. Profiling the execution of Bag-of-Tasks (BoT) applications on several kinds of BE-DCIs demonstrates that their task completion rate drops near the end of the execution.\u0000 In this paper, we present the SpeQuloS framework which enhances the QoS of BoT applications executed on BE-DCIs by reducing the execution time, improving its stability, and reporting to users a predicted completion time. SpeQuloS monitors the execution of the BoT on the BE-DCIs, and dynamically supplies fast and reliable Cloud resources when the critical part of the BoT is executed. We present the design and development of the service and several strategies to decide when and how Cloud resources should be provisioned. Performance evaluation using simulations shows that SpeQuloS fulfill its objectives. It speeds-up the execution of BoTs, in the best cases by a factor greater than 2, while offloading less than 2.5% of the workload to the Cloud. We report on preliminary results after a complex deployment as part of the European Desktop Grid Infrastructure.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125462366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PonD: dynamic creation of HTC pool on demand using a decentralized resource discovery system","authors":"Kyungyong Lee, D. Wolinsky, R. Figueiredo","doi":"10.1145/2287076.2287105","DOIUrl":"https://doi.org/10.1145/2287076.2287105","url":null,"abstract":"High Throughput Computing (HTC) platforms aggregate heterogeneous resources to provide vast amounts of computing power over a long period of time. Typical HTC systems, such as Condor and BOINC, rely on central managers for resource discovery and scheduling. While this approach simplifies deployment, it requires careful system configuration and management to ensure high availability and scalability. In this paper, we present a novel approach that integrates a self-organizing P2P overlay for scalable and timely discovery of resources with unmodified client/server job scheduling middleware in order to create HTC virtual resource Pools on Demand (PonD). This approach decouples resource discovery and scheduling from job execution/monitoring - a job submission dynamically generates an HTC platform based upon resources discovered through match-making from a large \"sea\" of resources in the P2P overlay and forms a \"PonD\" capable of leveraging unmodified HTC middleware for job execution and monitoring. We show that job scheduling time of our approach scales with O(log N), where N is the number of resources in a pool, through first-order analytical models and large-scale simulation results. To verify the practicality of PonD, we have implemented a prototype using Condor (called C-PonD), a structured P2P overlay, and a PonD creation module. Experimental results with the prototype in two WAN environments (PlanetLab and the FutureGrid cloud computing testbed) demonstrates the utility of C-PonD as a HTC approach without relying on a central repository for maintaining all resource information. Though the prototype is based on Condor, the decoupled nature of the system components - decentralized resource discovery, PonD creation, job execution/monitoring - is generally applicable to other grid computing middleware systems.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131175773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
André Luckow, M. Santcroos, Ole Weidner, André Merzky, Sharath Maddineni, S. Jha
{"title":"Towards a common model for pilot-jobs","authors":"André Luckow, M. Santcroos, Ole Weidner, André Merzky, Sharath Maddineni, S. Jha","doi":"10.1145/2287076.2287094","DOIUrl":"https://doi.org/10.1145/2287076.2287094","url":null,"abstract":"Pilot-Jobs have become one of the most successful abstractions in distributed computing. In spite of extensive uptake, there does not exist a well defined, unifying conceptual model of pilot-jobs which can be used to define, compare and contrast different implementations. This presents a barrier to extensibility and interoperability. This paper is an attempt to, (i) provide a minimal but complete model (P*) of pilot-jobs, (ii) establish the generality of the P* Model by mapping various existing and well known pilot-jobs frameworks such as Condor and DIANE to P*, (iii) demonstrate the interoperable and concurrent usage of distinct pilot-job frameworks on different production distributed cyberinfrastructures via the use of an extensible API for the P* Model (Pilot-API).","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"84 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115053417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Becchi, Kittisak Sajjapongse, I. Graves, A. Procter, Vignesh T. Ravi, S. Chakradhar
{"title":"A virtual memory based runtime to support multi-tenancy in clusters with GPUs","authors":"M. Becchi, Kittisak Sajjapongse, I. Graves, A. Procter, Vignesh T. Ravi, S. Chakradhar","doi":"10.1145/2287076.2287090","DOIUrl":"https://doi.org/10.1145/2287076.2287090","url":null,"abstract":"Graphics Processing Units (GPUs) are increasingly becoming part of HPC clusters. Nevertheless, cloud computing services and resource management frameworks targeting heterogeneous clusters including GPUs are still in their infancy. Further, GPU software stacks (e.g., CUDA driver and runtime) currently provide very limited support to concurrency.\u0000 In this paper, we propose a runtime system that provides abstraction and sharing of GPUs, while allowing isolation of concurrent applications. A central component of our runtime is a memory manager that provides a virtual memory abstraction to the applications. Our runtime is flexible in terms of scheduling policies, and allows dynamic (as opposed to programmer-defined) binding of applications to GPUs. In addition, our framework supports dynamic load balancing, dynamic upgrade and downgrade of GPUs, and is resilient to their failures. Our runtime can be deployed in combination with VM-based cloud computing services to allow virtualization of heterogeneous clusters, or in combination with HPC cluster resource managers to form an integrated resource management infrastructure for heterogeneous clusters. Experiments conducted on a three-node cluster show that our GPU sharing scheme allows up to a 28% and a 50% performance improvement over serialized execution on short- and long-running jobs, respectively. Further, dynamic inter-node load balancing leads to an additional 18-20% performance benefit.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131147488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jongse Park, DaeWoo Lee, Bokyeong Kim, Jaehyuk Huh, S. Maeng
{"title":"Locality-aware dynamic VM reconfiguration on MapReduce clouds","authors":"Jongse Park, DaeWoo Lee, Bokyeong Kim, Jaehyuk Huh, S. Maeng","doi":"10.1145/2287076.2287082","DOIUrl":"https://doi.org/10.1145/2287076.2287082","url":null,"abstract":"Cloud computing based on system virtualization, has been expanding its services to distributed data-intensive platforms such as MapReduce and Hadoop. Such a distributed platform on clouds runs in a virtual cluster consisting of a number of virtual machines. In the virtual cluster, demands on computing resources for each node may fluctuate, due to data locality and task behavior. However, current cloud services use a static cluster configuration, fixing or manually adjusting the computing capability of each virtual machine (VM). The fixed homogeneous VM configuration may not adapt to changing resource demands in individual nodes.\u0000 In this paper, we propose a dynamic VM reconfiguration technique for data-intensive computing on clouds, called Dynamic Resource Reconfiguration (DRR). DRR can adjust the computing capability of individual VMs to maximize the utilization of resources. Among several factors causing resource imbalance in the Hadoop platforms, this paper focuses on data locality. Although assigning tasks on the nodes containing their input data can improve the overall performance of a job significantly, the fixed computing capability of each node may not allow such locality-aware scheduling. DRR dynamically increases or decreases the computing capability of each node to enhance locality-aware task scheduling. We evaluate the potential performance improvement of DRR on a 100-node cluster, and its detailed behavior on a small scale cluster with constrained network bandwidth. On the 100-node cluster, DRR can improve the throughput of Hadoop jobs by 15% on average, and 41% on the private cluster with the constrained network connection.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"237 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116531596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}