Íñigo Goiri, F. Julià, Ramon Nou, J. L. Berral, Jordi Guitart, J. Torres
{"title":"Energy-Aware Scheduling in Virtualized Datacenters","authors":"Íñigo Goiri, F. Julià, Ramon Nou, J. L. Berral, Jordi Guitart, J. Torres","doi":"10.1109/CLUSTER.2010.15","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.15","url":null,"abstract":"The reduction of energy consumption in large-scale datacenters is being accomplished through an extensive use of virtualization, which enables the consolidation of multiple workloads in a smaller number of machines. Nevertheless, virtualization also incurs some additional overheads (e.g. virtual machine creation and migration) that can influence what is the best consolidated configuration, and thus, they must be taken into account. In this paper, we present a dynamic job scheduling policy for power-aware resource allocation in a virtualized datacenter. Our policy tries to consolidate workloads from separate machines into a smaller number of nodes, while fulfilling the amount of hardware resources needed to preserve the quality of service of each job. This allows turning off the spare servers, thus reducing the overall datacenter power consumption. As a novelty, this policy incorporates all the virtualization overheads in the decision process. In addition, our policy is prepared to consider other important parameters for a datacenter, such as reliability or dynamic SLA enforcement, in a synergistic way with power consumption. The introduced policy is evaluated comparing it against common policies in a simulated environment that accurately models HPC jobs execution in a virtualized datacenter including power consumption modeling and obtains a power consumption reduction of 15% with respect to typical policies.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115606013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karthik Kambatla, Naresh Rapolu, S. Jagannathan, A. Grama
{"title":"Asynchronous Algorithms in MapReduce","authors":"Karthik Kambatla, Naresh Rapolu, S. Jagannathan, A. Grama","doi":"10.1109/CLUSTER.2010.30","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.30","url":null,"abstract":"Asynchronous algorithms have been demonstrated to improve scalability of a variety of applications in parallel environments. Their distributed adaptations have received relatively less attention, particularly in the context of conventional execution environments and associated overheads. One such framework, MapReduce, has emerged as a commonly used programming framework for large-scale distributed environments. While the MapReduce programming model has proved to be effective for data-parallel applications, significant questions relating to its performance and application scope remain unresolved. The strict synchronization between map and reduce phases limits expression of asynchrony and hence, does not readily support asynchronous algorithms. This paper investigates the notion of partial synchronizations in iterative MapReduce applications to overcome global synchronization overheads. The proposed approach applies a locality-enhancing partition on the computation. Map tasks execute local computations with (relatively) frequent local synchronizations, with less frequent global synchronizations. This approach yields significant performance gains in distributed environments, even though their serial operation counts are higher. We demonstrate these performance gains on asynchronous algorithms for diverse applications, including pagerank, shortestpath, and kmeans. We make the following specific contributions in the paper(i) we motivate the need to extend MapReduce with constructs for asynchrony, (ii) we propose an API to facilitate partial synchronizations combined with eager scheduling and locality enhancing techniques, and (iii) demonstrate performance improvements from our proposed extensions through a variety of applications from different domains.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115255985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Parallel Subgraph Counting Using G-Tries","authors":"P. Ribeiro, Fernando M A Silva, Luís M. B. Lopes","doi":"10.1109/CLUSTER.2010.27","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.27","url":null,"abstract":"Finding and counting the occurrences of a collection of subgraphs within another larger network is a computationally hard problem, closely related to graph isomorphism. The subgraph count is by itself a very powerful characterization of a network and it is crucial for other important network measurements. G-tries are a specialized data-structure designed to store and search for subgraphs. By taking advantage of subgraph common substructure, g-tries can provide considerable speedups over previously used methods. In this paper we present a parallel algorithm based precisely on g-tries that is able to efficiently find and count subgraphs. The algorithm relies on randomized receiver-initiated dynamic load balancing and is able to stop its computation at any given time, efficiently store its search position, divide what is left to compute in two halfs, and resume from where it left. We apply our algorithm to several representative real complex networks from various domains and examine its scalability. We obtain an almost linear speedup up to 128 processors, thus allowing us to reach previously unfeasible limits. We showcase the multidisciplinary potential of the algorithm by also applying it to network motif discovery.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129715175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Q. Wei, B. Veeravalli, Bozhao Gong, Lingfang Zeng, D. Feng
{"title":"CDRM: A Cost-Effective Dynamic Replication Management Scheme for Cloud Storage Cluster","authors":"Q. Wei, B. Veeravalli, Bozhao Gong, Lingfang Zeng, D. Feng","doi":"10.1109/CLUSTER.2010.24","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.24","url":null,"abstract":"Data replication has been widely used as a mean of increasing the data availability of large-scale cloud storage systems where failures are normal. Aiming to provide cost-effective availability, and improve performance and load-balancing of cloud storage, this paper presents a cost-effective dynamic replication management scheme referred to as CDRM. A novel model is proposed to capture the relationship between availability and replica number. CDRM leverages this model to calculate and maintain minimal replica number for a given availability requirement. Replica placement is based on capacity and blocking probability of data nodes. By adjusting replica number and location according to workload changing and node capacity, CDRM can dynamically redistribute workloads among data nodes in the heterogeneous cloud. We implemented CDRM in Hadoop Distributed File System (HDFS) and experiment results conclusively demonstrate that our CDRM is cost effective and outperforms default replication management of HDFS in terms of performance and load balancing for large-scale cloud storage.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124607300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Replication-Based Highly Available Metadata Management for Cluster File Systems","authors":"Zhuan Chen, Jin Xiong, Dan Meng","doi":"10.1109/CLUSTER.2010.34","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.34","url":null,"abstract":"In cluster file systems, the metadata management is critical to the whole system. Past researches mainly focus on journaling which alone is not enough to provide high-available metadata service. Some others try to use replication, but the extra latency accompanied is a main problem. To guarantee both availability and efficiency, we propose a mechanism for building highly available metadata servers based on replication, which integrates Paxos algorithm effectively into metadata service. The Packed Multi-Paxos is proposed to reduce the latency brought by replication, which is self-adaptive and can make the replication to achieve high throughput under heavy client load and low latency under light client load. By designing efficient architecture and coordination mechanism, all replica server nodes simultaneously provide metadata read-access service. This high-available mechanism could decrease the impact of server failures and there is no interruption of service. The performance results show that the latency caused by replication and redundancy is well under control, and the performance of metadata read operation gains improvement.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130956537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability","authors":"Bo Li, Zhigang Huo, P. Zhang, Dan Meng","doi":"10.1109/CLUSTER.2010.19","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.19","url":null,"abstract":"As one of the most important enabling technologies of cloud computing, virtualization brings to HPC good manageability, online system maintenance, performance isolation and fault isolation. Furthermore, previous study on VMM-bypass I/O that virtualizes OS-bypass networks (e.g. InfiniBand) relieved the worry of performance degradation coming along with virtualization. In this paper, we address the scalability challenges imposed upon OS-bypass networks under virtualized environments. The eXtended Reliable Connection (XRC) transport, proposed in modern high-speed interconnection networks to address the scalability problem in large scale applications, would not work in virtualized environments. To solve the problem, we propose VM-proof XRC design to eliminate the scalability gap between virtualized and native environments. Prototype evaluation shows that the virtualization of modern high-speed interconnection networks could get the same raw performance and scalability as in native non-virtualized environment with our VM-proof XRC design. The connection memory scalability shows a potential of 16 times improvement on virtualized clusters composed of 16-core nodes.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130681357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cluster versus GPU implementation of an Orthogonal Target Detection Algorithm for Remotely Sensed Hyperspectral Images","authors":"Abel Paz, A. Plaza","doi":"10.1109/CLUSTER.2010.28","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.28","url":null,"abstract":"Remotely sensed hyperspectral imaging instruments provide high-dimensional data containing rich information in both the spatial and the spectral domain. In many surveillance applications, detecting objects (targets) is a very important task. In particular, algorithms for detecting (moving or static) targets, or targets that could expand their size (such as propagating fires) often require timely responses for swift decisions that depend upon high computing performance of algorithm analysis. In this paper, we develop parallel versions of a target detection algorithm based on orthogonal subspace projections. The parallel implementations are tested in two types of parallel computing architectures: a massively parallel cluster of computers called Thunderhead and available at NASA’s Goddard Space Flight Center in Maryland, and a commodity graphics processing unit (GPU) of NVidia GeForce GTX 275 type. While the cluster-based implementation reveals itself as appealing for information extraction from remote sensing data already transmitted to Earth, the GPU implementation allows us to perform near real-time anomaly detection in hyperspectral scenes, with speedups over 50x with regards to a highly optimized serial version. The proposed parallel algorithms are quantitatively evaluated using hyperspectral data collected by the NASA’s Airborne Visible Infra-Red Imaging Spectrometer (AVIRIS) system over the World Trade Center (WTC) in New York, five days after the attacks that collapsed the two main towers in the WTC complex.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127680753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of Tasks Reallocation in a Dedicated Grid Environment","authors":"Y. Caniou, G. Charrier, F. Desprez","doi":"10.1109/CLUSTER.2010.39","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.39","url":null,"abstract":"In this paper, we study the impact of tasks reallocation onto a multi-cluster environment where clusters are heterogeneous and use different resources management policies. In this context, we propose a reallocation mechanism that migrates waiting jobs from one cluster to another. We performed simulations using real traces to study benefits of reallocations. We compared two algorithms providing the reallocation mechanism, each with several heuristics to schedule jobs. Results show that in some cases it is possible to obtain a substantial gain on the average job response time (more than a factor of two). In the other cases, the reallocation mechanism is beneficial most of the time, making of great interest the implementation of a reallocation mechanism in a Grid framework.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115777171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kazuki Ohta, D. Kimpe, Jason Cope, K. Iskra, R. Ross, Y. Ishikawa
{"title":"Optimization Techniques at the I/O Forwarding Layer","authors":"Kazuki Ohta, D. Kimpe, Jason Cope, K. Iskra, R. Ross, Y. Ishikawa","doi":"10.1109/CLUSTER.2010.36","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.36","url":null,"abstract":"I/O is the critical bottleneck for data-intensive scientific applications on HPC systems and leadership-class machines. Applications running on these systems may encounter bottlenecks because the I/O systems cannot handle the overwhelming intensity and volume of I/O requests. Applications and systems use I/O forwarding to aggregate and delegate I/O requests to storage systems. In this paper, we present two optimization techniques at the I/O forwarding layer to further reduce I/O bottlenecks on leadership-class computing systems. The first optimization pipelines data transfers so that I/O requests overlap at the network and file system layer. The second optimization merges I/O requests and schedules I/O request delegation to the back-end parallel file systems. We implemented these optimizations in the I/O Forwarding Scalability Layer and them on the T2K Open Supercomputer at the University of Tokyo and the Surveyor Blue Gene/P system at the Argonne Leadership Computing Facility. On both systems, the optimizations improved application I/O throughput, but highlighted additional areas of I/O contention at the I/O forwarding layer that we plan to address.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114595505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and Performance Evaluation with uDAPL","authors":"Jasjit Singh, Yogeshwar Sonawane","doi":"10.1109/CLUSTER.2010.22","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.22","url":null,"abstract":"With an ever increasing demand for computing power, number of nodes to be deployed in a cluster based supercomputer is increasing. Limited hardware resources such as Endpoints (equivalent to Queue Pairs) on a Host Channel Adapter (HCA) of a high speed interconnect limit the scalability of a parallel application based on MPI that sets up reliable connections between every process pair using endpoints, prior to communication. In this paper, we propose a novel approach of multiplexing hardware endpoints (hweps) to extend scalability. (a) We discuss critical design issues with the multiplexing technique that differentiates a hwep from its software counterpart (swep) and enables sharing of hwep by multiple sweps. (b) We introduce the concept of Virtual Identifier (VID) which ensures that the connection between hardware endpoints is strictly one-to-one. (c) We also present static mapping scheme that offsets the overheads incurred due to multiplexing. User Direct Access Programming Library (uDAPL) defines a single set of APIs for all RDMA capable transports. We have incorporated the proposed multiplexing technique as a part of uDAPL implementation. Using this approach, we are able to scale MPI applications beyond the limit imposed by HCA and with no visible performance degradation.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116232936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}