{"title":"Characterization and Comparison of Cloud versus Grid Workloads","authors":"S. Di, Derrick Kondo, W. Cirne","doi":"10.1109/CLUSTER.2012.35","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.35","url":null,"abstract":"A new era of Cloud Computing has emerged, but the characteristics of Cloud load in data centers is not perfectly clear. Yet this characterization is critical for the design of novel Cloud job and resource management systems. In this paper, we comprehensively characterize the job/task load and host load in a real-world production data center at Google Inc. We use a detailed trace of over 25 million tasks across over 12,500 hosts. We study the differences between a Google data center and other Grid/HPC systems, from the perspective of both work load (w.r.t. jobs and tasks) and host load (w.r.t. machines). In particular, we study the job length, job submission frequency, and the resource utilization of jobs in the different systems, and also investigate valuable statistics of machine's maximum load, queue state and relative usage levels, with different job priorities and resource attributes. We find that the Google data center exhibits finer resource allocation with respect to CPU and memory than that of Grid/HPC systems. Google jobs are always submitted with much higher frequency and they are much shorter than Grid jobs. As such, Google host load exhibits higher variance and noise.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"25 4-5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120980968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei-Hsien Hsu, Chun-Fu Wang, K. Ma, Hongfeng Yu, Jacqueline H. Chen
{"title":"A Job Scheduling Design for Visualization Services Using GPU Clusters","authors":"Wei-Hsien Hsu, Chun-Fu Wang, K. Ma, Hongfeng Yu, Jacqueline H. Chen","doi":"10.1109/CLUSTER.2012.63","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.63","url":null,"abstract":"Modern large-scale heterogeneous computers incorporating GPUs offer impressive processing capabilities. It is desirable to fully utilize such systems for serving multiple users concurrently to visualize large data at interactive rates. However, as the disparity between data transfer speed and compute speed continues to increase in heterogeneous systems, data locality becomes crucial for performance. We present a new job scheduling design to support multi-user exploration of large data in a heterogeneous computing environment, achieving near optimal data locality and minimizing I/O overhead. The targeted application is a parallel visualization system which allows multiple users to render large volumetric data sets in both interactive mode and batch mode. We present a cost model to assess the performance of parallel volume rendering and quantify the efficiency of job scheduling. We have tested our job scheduling scheme on two heterogeneous systems with different configurations. The largest test volume data used in our study has over two billion grid points. The timing results demonstrate that our design effectively improves data locality for complex multi-user job scheduling problems, leading to better overall performance of the service.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116608581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Resource Utilization in MapReduce","authors":"Zhenhua Guo, G. Fox, Mo Zhou, Yang Ruan","doi":"10.1109/CLUSTER.2012.69","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.69","url":null,"abstract":"MapReduce has been adopted widely in both academia and industry to run large-scale data parallel applications. In MapReduce, each slave node hosts a number of task slots to which tasks can be assigned. So they limit the maximum number of tasks that can execute concurrently on each node. When all task slots of a node are not used, the resources “reserved” for idle slots are unutilized. To improve resource utilization, we propose resource stealing to enable running tasks to steal resources reserved for idle slots and give them back proportionally whenever new tasks are assigned. Resource stealing makes the otherwise wasted resources get fully utilized without interfering with normal job scheduling. MapReduce uses speculative execution to improve fault tolerance. Current Hadoop implementation decides whether to run speculative tasks based on the progress rates of running tasks, which does not take into consideration the absolute progress of each task. We propose Benefit Aware Speculative Execution which evaluates the potential benefit of speculative tasks and eliminates unnecessary runs. We implement the proposed algorithms in Hadoop, and our experiments show that our algorithms can significantly shorten job execution time and reduce the number of non-beneficial speculative tasks.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"330 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122743179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluation and Optimization of Breadth-First Search on NUMA Cluster","authors":"Zehan Cui, Licheng Chen, Mingyu Chen, Yungang Bao, Yongbing Huang, Huiwei Lv","doi":"10.1109/CLUSTER.2012.29","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.29","url":null,"abstract":"Graph is widely used in many areas. Breadth-First Search (BFS), a key subroutine for many graph analysis algorithms, has become the primary benchmark for Graph500 ranking. Due to the high communication cost of BFS, multi-socket nodes with large memory capacity (NUMA) are supposed to reduce network pressure. However, the longer latency to remote memory may cause problem if not treated well. In this work, we first demonstrate that simply spawning and binding one MPI process for each socket can achieve the best performance for MPI/OpenMP hybrid programmed BFS algorithm, resulting in 1.53X of performance on 16 nodes. Nevertheless, we notice that one MPI process per socket may exacerbate the communication cost. We propose to share some communication data structure among the processes inside the same node, to eliminate most of the intra-node communication. To fully utilize the network bandwidth, we make all the processes in a node to perform communication simultaneously. We further adjust the granularity of a key bitmap for better cache locality to speed up the computation. With all the optimizations for NUMA, communication and computation together, 2.44X of performance is achieved on 16 nodes, which is 39.2 Billion Traversed Edges per Second for an R-MAT graph of scale 32 (4 billion vertices and 64 billion edges).","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131281180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vlad Slavici, R. Varier, G. Cooperman, R. Harrison
{"title":"Adapting Irregular Computations to Large CPU-GPU Clusters in the MADNESS Framework","authors":"Vlad Slavici, R. Varier, G. Cooperman, R. Harrison","doi":"10.1109/CLUSTER.2012.42","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.42","url":null,"abstract":"Graphics Processing Units (GPUs) are becoming the workhorse of scalable computations. MADNESS is a scientific framework used especially for computational chemistry. Most MADNESS applications use operators that involve many small tensor computations, resulting in a less regular organization of computations on GPUs. A single GPU kernel may have to multiply by hundreds of small square matrices (with fixed dimension ranging from 10 to 28). We demonstrate a scalable CPU-GPU implementation of the MADNESS framework over a 500-node partition on the Titan supercomputer. For this hybrid CPU-GPU implementation, we observe up to a 2.3-times speedup compared to an equivalent CPU-only implementation with 16 cores per node. For smaller matrices, we demonstrate a speedup of 2.2-times by using a custom CUDA kernel rather than a cuBLAS-based kernel.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123191088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ME2: Efficient Live Migration of Virtual Machine with Memory Exploration and Encoding","authors":"Yanqing Ma, Hongbo Wang, Jiankang Dong, Yangyang Li, Shiduan Cheng","doi":"10.1109/CLUSTER.2012.52","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.52","url":null,"abstract":"Live migration of virtual machine plays an important role in data center, which can successfully migrate virtual machine from one physical machine to another with only slight influence on upper workload. It can be used to facilitate hardware maintenance, load balancing, fault-tolerance and power-saving, especially in cloud computing data centers. Although the pre-copy is the prevailing approach, it cannot distinguish which memory page is used, resulting in transferring large amounts of useless memory pages. This paper presents a novel approach Memory Exploration and Encoding (ME2), which first identifies useful pages and then utilizes Run Length Encode algorithm to quickly encode memory, to efficiently decrease the total transferred data, total migration time and downtime. Experiments demonstrate that ME2 can significantly decrease 50.5% of total transferred data, 48.2% of total time and 47.6% of downtime on average compared with Xen's pre-copy algorithm.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124445385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"sEBP: Event Based Polling for Efficient I/O Virtualization","authors":"Kun Tian, Yaozu Dong, Xiang Mi, Haibing Guan","doi":"10.1109/CLUSTER.2012.50","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.50","url":null,"abstract":"Interrupt virtualization remains a key overhead source in high performance network virtualization (Single-root I/O virtualization or SR-IOV). SR-IOV can give close to line rate network bandwidth and good scalability in the 10 Gbps network environment, however the overhead of the interrupt virtualization in SR-IOV remains non-trivial, due to additional trap-and emulation overhead on the virtual interrupt controller, and high interrupt frequency brought by the high bandwidth network. In this paper we propose sEBP, an event-based polling model to eliminate the interrupts from the critical I/O paths in the virtual environment. A variety of system events are collected by sEBP, either at the guest kernel level or at the VMM level. Upon those events the NIC status is polled. The polling is lightweight, and plenty of system events fulfill the role of the interrupts. By removing the overhead of the interrupts, sEBP manages to achieve up to 59% performance improvement and 23% better scalability ratio.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"14 S3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113957665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automated Load Balancing Invocation Based on Application Characteristics","authors":"Harshitha Menon, Nikhil Jain, G. Zheng, L. Kalé","doi":"10.1109/CLUSTER.2012.61","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.61","url":null,"abstract":"Performance of applications executed on large parallel systems suffer due to load imbalance. Load balancing is required to scale such applications to large systems. However, performing load balancing incurs a cost which may not be known a priori. In addition, application characteristics may change due to its dynamic nature and the parallel system used for execution. As a result, deciding when to balance the load to obtain the best performance is challenging. Existing approaches put this burden on the users, who rely on educated guess and extrapolation techniques to decide on a reasonable load balancing period, which may not be feasible and efficient. In this paper, we propose the Meta-Balancer framework which relieves the application programmers of deciding when to balance load. By continuously monitoring the application characteristics and using a set of guiding principles, Meta-Balancer invokes load balancing on its own without any prior application knowledge. We demonstrate that Meta-Balancer improves or matches the best performance that can be obtained by fine tuning periodic load balancing. We also show that in some cases Meta-Balancer improves performance by 18% whereas periodic load balancing gives only a 1.5% benefit.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"49 15","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113974267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lingfang Zeng, D. Feng, Bo Mao, Jianxi Chen, Q. Wei, Wenguo Liu
{"title":"HerpRap: A Hybrid Array Architecture Providing Any Point-in-Time Data Tracking for Datacenter","authors":"Lingfang Zeng, D. Feng, Bo Mao, Jianxi Chen, Q. Wei, Wenguo Liu","doi":"10.1109/CLUSTER.2012.19","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.19","url":null,"abstract":"Both physical disk failure and logical errors such as software error, user abuse and virus attacks may cause data lose. The risk of logical errors is far greater than physical disk failure. Moreover, existing RAID solution cannot satisfy the reliability requirement in face of the logical errors in data centers. It is therefore becoming increasingly important for RAID-based storage systems to be able to recover data to any point-in-time when logical errors occur. We proposed a novel storage array architecture, Herp Rap, which is able to recover data from both physical disk failure and logical errors. We have implemented a prototype of Herp Rap and carried out extensive performance measurements using DBT-2 and file system benchmarks. Our experiments demonstrated that the proposed Herp Rap is able to track or recover data to any point-in-time quickly by tracing back the history of block logs. Moreover, Herp Rap outperforms existing HDD-based or SSD-based RAID5 with copy-on-write (COW) snapshot in terms of performance, energy efficiency, failure recovery ability and reliability.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129537307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"eco-IDC: Trade Delay for Energy Cost with Service Delay Guarantee for Internet Data Centers","authors":"Jianying Luo, Lei Rao, Xue Liu","doi":"10.1109/CLUSTER.2012.23","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.23","url":null,"abstract":"Cloud computing services are becoming integral part of people's daily life. These services are supported by Internet data centers (IDCs). As demand for cloud computing services soars, energy consumed by IDCs is skyrocketing. This paper studies an energy management problem - how to minimize energy cost for IDCs in deregulated electricity markets. While several existing works handle this problem by leveraging spatial diversity of electricity price, little has been done to address the temporal uncertainty in electricity price and arriving workload. This paper proposes a novel two-stage design and the eco-IDC (Energy Cost Optimization-IDC) algorithm to exploit temporal diversity of electricity price and dynamically schedule workload to execute on IDC servers through an input queue. Extensive evaluation experiments are performed to demonstrate that the proposed approach significantly reduces energy cost for IDCs, and guarantees a service delay bound for user requests.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124975711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}