{"title":"Dynamic adaptive scheduling for virtual machines","authors":"Chuliang Weng, Qian Liu, Lei Yu, Minglu Li","doi":"10.1145/1996130.1996163","DOIUrl":"https://doi.org/10.1145/1996130.1996163","url":null,"abstract":"With multi-core processors becoming popular, exploiting their computational potential becomes an urgent matter. The functionality of multiple standalone computer systems can be aggregated into a single hardware computer by virtualization, giving efficient usage of the hardware and decreased cost for power. Some principles of operating systems can be applied directly to virtual machine systems, however virtualization disrupts the basis of spinlock synchronization in the guest operating system, which results in performance degradation of concurrent workloads such as parallel programs or multi-threaded programs in virtual machines.\u0000 Eliminating this negative influence of virtualization on synchronization seems to be a non-trivial challenge, especially for concurrent workloads. In this work, we first demonstrate with parallel benchmarks that virtualization can cause long waiting times for spinlock synchronization in the guest operating system, resulting in performance degradation of parallel programs in the virtualized system. Then we propose an adaptive dynamic coscheduling approach to mitigate the performance degradation of concurrent workloads running in virtual machines, while keeping the performance of non-concurrent workloads. For this purpose, we build an adaptive scheduling framework with a series of algorithms to dynamically detect the occurrence of spinlocks with long waiting times, and determine and execute coscheduling of virtual CPUs on physical CPUs in the virtual machine monitor. We have implemented a prototype (ASMan) based on Xen and Linux. Experiments show that ASMan achieves better performance for concurrent workloads, while maintaining the performance for non-concurrent workloads. ASMan coscheduling depends directly on the dynamic behavior of virtual CPUs, unlike other approaches which depend on static properties of workloads and manual setting of rules. Therefore, ASMan achieves a better trade-off between coscheduling and non-coscheduling in the virtual machine monitor, and is an effective solution to this open issue.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115476881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Algorithm-based recovery for iterative methods without checkpointing","authors":"Zizhong Chen","doi":"10.1145/1996130.1996142","DOIUrl":"https://doi.org/10.1145/1996130.1996142","url":null,"abstract":"In today's high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied to a wide range of applications, it often introduces a considerable overhead especially when computations reach petascale and beyond. In this paper, we show that, for many iterative methods, if the parallel data partitioning scheme satisfies certain conditions, the iterative methods themselves will maintain enough inherent redundant information for the accurate recovery of the lost data without checkpointing. We analyze the block row data partitioning scheme for sparse matrices and derive a sufficient condition for recovering the critical data without checkpointing. When this sufficient condition is satisfied, neither checkpoint nor roll-back is necessary for the recovery. Furthermore, the fault tolerance overhead (time) is zero if no actual failures occur during a program execution. Overhead is introduced only when an actual failure occurs. Experimental results demonstrate that, when it works, the proposed scheme introduces much less overhead than checkpointing on the current world's eighth-fastest supercomputer Kraken.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115491916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Capping the electricity cost of cloud-scale data centers with impacts on power markets","authors":"Yanwei Zhang, Yefu Wang, Xiaorui Wang","doi":"10.1145/1996130.1996170","DOIUrl":"https://doi.org/10.1145/1996130.1996170","url":null,"abstract":"In this paper, we propose a novel electricity cost capping algorithm that not only minimizes the electricity cost of operating cloud-scale data centers, but also enforces a cost budget on the monthly electricity bill. Our solution first explicitly models the impacts of power demands on electricity prices and the power consumption of cooling and networking in the minimization of electricity cost. In the second step, if the electricity cost exceeds a desired monthly budget due to unexpectedly high workloads, our solution guarantees the quality of service for premium customers and trades off the request throughput of ordinary customers. We formulate electricity cost capping as two related constrained optimization problems and propose an efficient algorithm based on mixed integer programming. Simulation results show that our solution outperforms the state-of-the-art solutions by having lower electricity costs and achieves desired cost capping with maximized request throughput.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122952169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sunghwan Yoo, Hyojeong Lee, C. Killian, Milind Kulkarni
{"title":"InContext: simple parallelism for distributed applications","authors":"Sunghwan Yoo, Hyojeong Lee, C. Killian, Milind Kulkarni","doi":"10.1145/1996130.1996144","DOIUrl":"https://doi.org/10.1145/1996130.1996144","url":null,"abstract":"As networking services, such as DHTs, provide increasingly complex functionality, providing acceptable performance will require parallelizing their operations on individual nodes. Unfortunately, the event-driven style in which these applications have traditionally been written makes it difficult to reason about parallelism, and providing safe, efficient parallel implementations of distributed systems remains a challenge. In this paper, we introduce a declarative programming model based on contexts, which allows programmers to specify the sharing behavior of event handlers. Programs that adhere to the programming model can be safely parallelized according to an abstract execution model, with parallel behavior that is well-defined with respect to the expected sequential behavior. The declarative nature of the programming model allows conformance to be captured as a safety property that can be verified using a model checker.\u0000 We develop a prototype implementation of our abstract execution model and show that distributed applications written in our programming model can be automatically and efficiently parallelized. To recover additional parallelism, we present an optimization to the implementation based on state snapshots that permits more events to proceed in parallel. We evaluate our prototype implementation through several case studies and demonstrate significant speedup over optimized sequential implementations.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131148976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Live gang migration of virtual machines","authors":"Umesh Deshpande, Xiaoshuang Wang, Kartik Gopalan","doi":"10.1145/1996130.1996151","DOIUrl":"https://doi.org/10.1145/1996130.1996151","url":null,"abstract":"This paper addresses the problem of simultaneously migrating a group of co-located and live virtual machines (VMs), i.e, VMs executing on the same physical machine. We refer to such a mass simultaneous migration of active VMs as \"live gang migration\". Cluster administrators may often need to perform live gang migration for load balancing, system maintenance, or power savings. Application performance requirements may dictate that the total migration time, network traffic overhead, and service downtime, be kept minimal when migrating multiple VMs. State-of-the-art live migration techniques optimize the migration of a single VM. In this paper, we optimize the simultaneous live migration of multiple co-located VMs. We present the design, implementation, and evaluation of a de-duplication based approach to perform concurrent live migration of co-located VMs. Our approach transmits memory content that is identical across VMs only once during migration to significantly reduce both the total migration time and network traffic. Using the QEMU/KVM platform, we detail a proof-of-concept prototype implementation of two types of de-duplication strategies (at page level and sub-page level) and a differential compression approach to exploit content similarity across VMs. Evaluations over Gigabit Ethernet with various types of VM workloads demonstrate that our prototype for live gang migration can achieve significant reductions in both network traffic and total migration time.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123056742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hui Kang, Yao Chen, Jennifer L. Wong, R. Sion, Jason Wu
{"title":"Enhancement of Xen's scheduler for MapReduce workloads","authors":"Hui Kang, Yao Chen, Jennifer L. Wong, R. Sion, Jason Wu","doi":"10.1145/1996130.1996164","DOIUrl":"https://doi.org/10.1145/1996130.1996164","url":null,"abstract":"As the trends move towards data outsourcing and cloud computing, the efficiency of distributed data centers increases in importance. Cloud-based services such as Amazon's EC2 rely on virtual machines (VMs) to host MapReduce clusters for large data processing. However, current VM scheduling does not provide adequate support for MapReduce workloads, resulting in degraded overall performance. For example, when multiple MapReduce clusters run on a single physical machine, the existing VMMscheduler does not guarantee fairness across clusters.\u0000 In this work, we present theMapReduce Group Scheduler (MRG). The MRG scheduler implements three mechanisms to improve the efficiency and fairness of the existing VMM scheduler. First, the characteristics of MapReduce workloads facilitate batching of I/O requests from VMs working on the same job, which reduces the number of context switches and brings other benefits. Second, because most MapReduce workloads incur a significant amount of I/O blocking events and the completion of a job depends on the progress of all nodes, we propose a two-level scheduling policy to achieve proportional fair sharing across both MapReduce clusters and individual VMs. Finally, the proposed MRG scheduler also operates on symmetric multi-processor (SMP) enabled platforms. The key to these improvements is to group the scheduling of VMs belonging to the same MapReduce cluster.\u0000 We have implemented the proposed scheduler by modifying the existing Xen hypervisor and evaluated the performance on Hadoop, an open source implementation of MapReduce. Our evaluations, using four representative MapReduce benchmarks, show that the proposed scheduler reduces context switch overhead and achieves increased proportional fairness across multiple MapReduce clusters, without penalizing the completion time of MapReduce jobs.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127044297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vrisha: using scaling properties of parallel programs for bug detection and localization","authors":"Bowen Zhou, Milind Kulkarni, S. Bagchi","doi":"10.1145/1996130.1996143","DOIUrl":"https://doi.org/10.1145/1996130.1996143","url":null,"abstract":"Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while small-scale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or performance of an application. A popular approach to finding bugs is statistical bug detection, where abnormal behavior is detected through comparison with bug-free behavior. Unfortunately, for scale-dependent bugs, there may not be bug-free runs at large scales and therefore traditional statistical techniques are not viable. In this paper, we propose Vrisha, a statistical approach to detecting and localizing scale-dependent bugs. Vrisha detects bugs in large-scale programs by building models of behavior based on bug-free behavior at small scales. These models are constructed using kernel canonical correlation analysis (KCCA) and exploit scale-determined properties, whose values are predictably dependent on application scale. We use Vrisha to detect and diagnose two bugs caused by errors in popular MPI libraries and show that our techniques can be implemented with low overhead and low false-positive rates.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130673498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards a profound analysis of bags-of-tasks in parallel systems and their performance impact","authors":"T. Minh, L. Wolters","doi":"10.1145/1996130.1996148","DOIUrl":"https://doi.org/10.1145/1996130.1996148","url":null,"abstract":"The Bag-of-Tasks (BoT) behaviour has recently drawn the attention of scheduling researchers [2, 36, 37] and seems to be very common in workloads of parallel systems (up to 70% of jobs [27]) and grids (up to 96% of the total CPU time is consumed by BoTs [11]). To enable a reliable evaluation of BoT-oriented scheduling algorithms, researchers require realistic workload models that take BoTs into account. Regrettably, very few such models are available in the liturature. To our best knowledge, there are only two studies on modeling that incorporate BoTs into their models to generate synthetic workloads for parallel systems [27] and grids [12]. However, these models only focus on fitting the marginal distributions and neglect several other statistical properties of BoTs such as periodicity, autocorrelation and cross-correlation among BoT attributes. We believe that these crucial characteristics deserve to be taken into account in modeling research. Therefore in this paper, we will focus on characterising the BoT behaviour to further improve researchers' understanding of this well-known behaviour in parallel system workloads. In addition, we also study how BoTs affect parallel system performance. Our experimental results indicate that the presence of BoTs leads to a considerable performance degradation, but it is interesting that a realistic association between job arrivals and job runtimes helps BoTs to improve the performance of parallel systems. Moreover, we also show the necessity of using workloads with BoTs in scheduling evaluation to obtain reliable results.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133938262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cache injection for parallel applications","authors":"E. León, R. Riesen, Kurt B. Ferreira, A. Maccabe","doi":"10.1145/1996130.1996135","DOIUrl":"https://doi.org/10.1145/1996130.1996135","url":null,"abstract":"For two decades, the memory wall has affected many applications in their ability to benefit from improvements in processor speed. Cache injection addresses this disparity for I/O by writing data into a processor's cache directly from the I/O bus. This technique reduces data latency and, unlike data prefetching, improves memory bandwidth utilization. These improvements are significant for data-intensive applications whose performance is dominated by compulsory cache misses.\u0000 We present an empirical evaluation of three injection policies and their effect on the performance of two parallel applications and several collective micro-benchmarks. We demonstrate that the effectiveness of cache injection on performance is a function of the communication characteristics of applications, the injection policy, the target cache, and the severity of the memory wall. For example, we show that injecting message payloads to the L3 cache can improve the performance of network-bandwidth limited applications. In addition, we show that cache injection improves the performance of several collective operations, but not all-to-all operations (implementation dependent). Our study shows negligible pollution to the target caches.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128166641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junliang Chen, Chen Wang, B. Zhou, Lei Sun, Young Choon Lee, Albert Y. Zomaya
{"title":"Tradeoffs Between Profit and Customer Satisfaction for Service Provisioning in the Cloud","authors":"Junliang Chen, Chen Wang, B. Zhou, Lei Sun, Young Choon Lee, Albert Y. Zomaya","doi":"10.1145/1996130.1996161","DOIUrl":"https://doi.org/10.1145/1996130.1996161","url":null,"abstract":"The recent cloud computing paradigm represents a trend of moving business applications to platforms run by parties located in different administrative domains. A cloud platform is often highly scalable and cost-effective through its pay-as-you-go pricing model. However, being shared by a large number of users, the running of applications in the platform faces higher performance uncertainty compared to a dedicated platform. Existing Service Level Agreements (SLAs) cannot sufficiently address the performance variation issue. In this paper, we use utility theory leveraged from economics and develop a new utility model for measuring customer satisfaction in the cloud. Based on the utility model, we design a mechanism to support utility-based SLAs in order to balance the performance of applications and the cost of running them. We consider an infrastructure-as-a-service type cloud platform (e.g., Amazon EC2), where a business service provider leases virtual machine (VM) instances with spot prices from the cloud and gains revenue by serving its customers. Particularly, we investigate the interaction of service profit and customer satisfaction. In addition, we present two scheduling algorithms that can effectively bid for different types of VM instances to make tradeoffs between profit and customer satisfaction. We conduct extensive simulations based on the performance data of different types of Amazon EC2 instances and their price history. Our experimental results demonstrate that the algorithms perform well across the metrics of profit, customer satisfaction and instance utilization.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"185 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116423061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}