Haibao Chen, Song Wu, S. Di, B. Zhou, Zhenjiang Xie, Hai Jin, Xuanhua Shi
{"title":"Communication-driven scheduling for virtual clusters in cloud","authors":"Haibao Chen, Song Wu, S. Di, B. Zhou, Zhenjiang Xie, Hai Jin, Xuanhua Shi","doi":"10.1145/2600212.2600714","DOIUrl":"https://doi.org/10.1145/2600212.2600714","url":null,"abstract":"Due to high flexibility and cost-effectiveness, cloud computing is increasingly being explored as an alternative to local clusters by academic and commercial users. Recent research already confirmed the feasibility of running tightly-coupled parallel applications with virtual clusters. However, such types of applications suffer from significant performance degradation, especially as the overcommitment is common in cloud. That is, the number of executable Virtual CPUs (VCPUs) is often larger than that of available Physical CPUs (PCPUs) in the system. The performance degradation mainly results from that the current Virtual Machine Monitors (VMMs) cannot co-schedule (or coordinate at the same time) the VCPUs that host parallel application threads/processes with synchronization requirements.\u0000 We introduce a communication-driven scheduling approach for virtual clusters in this paper, which can effectively mitigate the performance degradation of tightly-coupled parallel applications running atop them in overcommitted situation. There are two key contributions. 1) We propose a communication-driven VM scheduling (CVS) algorithm, by which the involved VMM schedulers can autonomously schedule suitable VMs at runtime. 2) We integrate the CVS algorithm into Xen VMM scheduler, and rigorously implement a prototype. We evaluate our design on a real cluster environment, and experiments show that our solution attains better performance for tightly-coupled parallel applications than the state-of-the-art approaches like Credit scheduler of Xen, balance scheduling, and hybrid scheduling.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115291439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"When paxos meets erasure code: reduce network and storage cost in state machine replication","authors":"Shuai Mu, Kang Chen, Yongwei Wu, Weimin Zheng","doi":"10.1145/2600212.2600218","DOIUrl":"https://doi.org/10.1145/2600212.2600218","url":null,"abstract":"Paxos-based state machine replication is a key technique to build highly reliable and available distributed services, such as lock servers, databases and other data storage systems. Paxos can tolerate any minority number of node crashes in an asynchronous network environment. Traditionally, Paxos is used to perform a full copy replication across all participants. However, full copy is expensive both in term of network and storage cost, especially in wide area with commodity hard drives.\u0000 In this paper, we discussed the non-triviality and feasibility of combining erasure code into Paxos protocol, and presented an improved protocol named RS-Paxos (Reed Solomon Paxos). To the best of our knowledge, we are the first to propose such a combination. Compared to Paxos, RS-Paxos requires a limitation on the number of possible failures. If the number of tolerated failures decreases by 1, RS-Paxos can save over 50% of network transmission and disk I/O. To demonstrate the benefits of our protocol, we designed and built a key-value store based on RS-Paxos, and evaluated it on EC2 with various settings. Experiment results show that RS-Paxos achieves at most 2.5x improvement on write throughput and as much as 30% reduction on latency, in common configurations.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116043830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CBL: exploiting community based locality for efficient content search in online social networks","authors":"Hanhua Chen, Fan Zhang, Hai Jin","doi":"10.1145/2600212.2600707","DOIUrl":"https://doi.org/10.1145/2600212.2600707","url":null,"abstract":"Retrieving relevant data for users in online social network (OSN) systems is a challenging problem. Cassandra, a storage system used by popular OSN systems, such as Facebook and Twitter, relies on a DHT-based scheme to randomly partition the personal data of users among servers across multiple data centers. Although DHT is highly scalable for hosting a large number of users (personal data), it leads to costly inter-server communications across data centers due to the complex interconnection and interaction among OSN users. In this paper, we explore how to retrieve the OSN content in a cost-effective way by retaining the simple and robust nature of OSNs. Our approach exploits a simple, yet powerful principle called Community-Based Locality (CBL), which posits that if a user has an one-hop neighbor within a particular community, it is very likely that the user has other one-hop neighbors inside the same community. We demonstrate the existence of community-based locality in diverse traces of popular OSN systems such as Facebook, Orkut, Flickr, Youtube, and Livejournal.\u0000 Based on the observation, we design a CBL-based algorithm to build the content index in OSN systems. By partitioning and indexing the relevant data of users within a community on the same server in the data center, the CBL-based index avoids a significant amount of inter-server communications during searching, making retrieving relevant data for a user in large-scale OSNs efficient. In addition, by using CBL-based scheme we can provide much shorter query latency and balanced loads. We conduct comprehensive trace-driven simulations to evaluate the performance of the proposed scheme. Results show that our scheme significantly reduces the network traffic by 73% compared with existing schemes.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123936502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dong Dai, Yong Chen, D. Kimpe, R. Ross, Xuehai Zhou
{"title":"Domino: an incremental computing framework in cloud with eventual synchronization","authors":"Dong Dai, Yong Chen, D. Kimpe, R. Ross, Xuehai Zhou","doi":"10.1145/2600212.2600705","DOIUrl":"https://doi.org/10.1145/2600212.2600705","url":null,"abstract":"In recent years, more and more applications in cloud have needed to process large-scale on-line data sets that evolve over time as entries are added or modified. Several programming frameworks, such as Percolator and Oolong, are proposed for such incremental data processing and can achieve efficient updates with an event-driven abstraction. However, these frameworks are inherently asynchronous, leaving the heavy burden of managing synchronization to applications developers. Such a limitation significantly restricts their usability. In this paper, we introduce a trigger-based incremental computing framework, called Domino, with a flexible synchronization mechanism and runtime optimizations to coordinate parallel triggers efficiently. With this new framework, both synchronous and asynchronous applications can be seamlessly developed. Use cases and current evaluation results confirm that the new Domino programming model delivers sufficient performance and is easy to use in large-scale distributed computing.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116214762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Prisacari, G. Rodríguez, P. Heidelberger, Dong Chen, C. Minkenberg, T. Hoefler
{"title":"Efficient task placement and routing of nearest neighbor exchanges in dragonfly networks","authors":"B. Prisacari, G. Rodríguez, P. Heidelberger, Dong Chen, C. Minkenberg, T. Hoefler","doi":"10.1145/2600212.2600225","DOIUrl":"https://doi.org/10.1145/2600212.2600225","url":null,"abstract":"Dragonflies are recent network designs that are one of the most promising topologies for the Exascale effort due to their scalability and cost. While being able to achieve very high throughput under random uniform all-to-all traffic, this type of network can experience significant performance degradation for other common high performance computing workloads such as stencil (multi-dimensional nearest neighbor) patterns. Often, the lack of peak performance is caused by an insufficient understanding of the interaction between the workload and the network, and an insufficient understanding of how application specific task-to-node mapping strategies can serve as optimization vehicles.\u0000 To address these issues, we propose a theoretical performance analysis framework that takes as inputs a network specification and a traffic demand matrix characterizing an arbitrary workload and is able to predict where bottlenecks will occur in the network and what their impact will be on the effective sustainable injection bandwidth. We then focus our analysis on a specific high-interest communication pattern, the multi-dimensional Cartesian nearest neighbor exchange, and provide analytic bounds (owing to bottlenecks in the remote links of the Dragonfly) on its expected performance across a multitude of possible mapping strategies.\u0000 Finally, using a comprehensive set of simulations results, we validate the correctness of the theoretical approach and in the process address some misconceptions regarding Dragonfly network behavior and evaluation, (such as the choice of throughput maximization over workload completion time minimization as optimization objective) and the question of whether the standard notion of Dragonfly balance can be extended to workloads other than uniform random traffic.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128152482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rong Chen, X. Ding, Peng Wang, Haibo Chen, B. Zang, Haibing Guan
{"title":"Computation and communication efficient graph processing with distributed immutable view","authors":"Rong Chen, X. Ding, Peng Wang, Haibo Chen, B. Zang, Haibing Guan","doi":"10.1145/2600212.2600233","DOIUrl":"https://doi.org/10.1145/2600212.2600233","url":null,"abstract":"Cyclops is a new vertex-oriented graph-parallel framework for writing distributed graph analytics. Unlike existing distributed graph computation models, Cyclops retains simplicity and computation-efficiency by synchronously computing over a distributed immutable view, which grants a vertex with read-only access to all its neighboring vertices. The view is provided via read- only replication of vertices for edges spanning machines during a graph cut. Cyclops follows a centralized computation model by assigning a master vertex to update and propagate the value to its replicas unidirectionally in each iteration, which can significantly reduce messages and avoid contention on replicas. Being aware of the pervasively available multicore-based clusters, Cyclops is further extended with a hierarchical processing model, which aggregates messages and replicas in a single multicore machine and transparently decomposes each worker into multiple threads on-demand for different stages of computation. We have implemented Cyclops based on an open-source Pregel clone called Hama. Our evaluation using a set of graph algorithms on an in-house multicore cluster shows that Cyclops outperforms Hama from 2.06X to 8.69X and 5.95X to 23.04X using hash-based and Metis partition algorithms accordingly, due to the elimination of contention on messages and hierarchical optimization for the multicore-based clusters. Cyclops (written in Java) also has comparable performance with PowerGraph (written in C++) despite the language difference, due to the significantly lower number of messages and avoided contention.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133217930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Min Li, Liangzhao Zeng, S. Meng, Jian Tan, Li Zhang, A. Butt, Nicholas C. Fuller
{"title":"MRONLINE: MapReduce online performance tuning","authors":"Min Li, Liangzhao Zeng, S. Meng, Jian Tan, Li Zhang, A. Butt, Nicholas C. Fuller","doi":"10.1145/2600212.2600229","DOIUrl":"https://doi.org/10.1145/2600212.2600229","url":null,"abstract":"MapReduce job parameter tuning is a daunting and time consuming task. The parameter configuration space is huge; there are more than 70 parameters that impact job performance. It is also difficult for users to determine suitable values for the parameters without first having a good understanding of the MapReduce application characteristics. Thus, it is a challenge to systematically explore the parameter space and select a near-optimal configuration. Extant offline tuning approaches are slow and inefficient as they entail multiple test runs and significant human effort.\u0000 To this end, we propose an online performance tuning system, MRONLINE, that monitors a job's execution, tunes associated performance-tuning parameters based on collected statistics, and provides fine-grained control over parameter configuration. MRONLINE allows each task to have a different configuration, instead of having to use the same configuration for all tasks. Moreover, we design a gray-box based smart hill climbing algorithm that can efficiently converge to a near-optimal configuration with high probability. To improve the search quality and increase convergence speed, we also incorporate a set of MapReduce-specific tuning rules in MRONLINE. Our results using a real implementation on a representative 19-node cluster show that dynamic performance tuning can effectively improve MapReduce application performance by up to 30% compared to the default configuration used in YARN.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133872940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zbynek Falt, D. Bednárek, Martin Kruliš, J. Yaghob, F. Zavoral
{"title":"Bobolang: a language for parallel streaming applications","authors":"Zbynek Falt, D. Bednárek, Martin Kruliš, J. Yaghob, F. Zavoral","doi":"10.1145/2600212.2600711","DOIUrl":"https://doi.org/10.1145/2600212.2600711","url":null,"abstract":"At present time, the programmers may choose from a number of streaming languages. They cover various aspects of the development process of streaming applications; however, specification of complex or runtime-dependent parts of the applications still remains a great challenge. We have analysed a large amount of requirements raised by the development of multiple data streaming parallel applications and proposed a novel language called Bobolang. It contains syntactic and semantic features which allow the programmer to naturally solve most of the problems, which we met in the design of streaming applications. The language is used to specify the structure of the whole application as well as the inner structure of each operator. Thanks to the properties of the language, Bobolang can create an optimized evaluation plan which is capable of making the best use of the available hardware resources. The language has been employed in several practical problems and it has proven itself to be a very powerful tool for the development of data-intensive parallel applications.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125498126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nusrat S. Islam, Xiaoyi Lu, Md. Wasi-ur-Rahman, D. Panda
{"title":"SOR-HDFS: a SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS","authors":"Nusrat S. Islam, Xiaoyi Lu, Md. Wasi-ur-Rahman, D. Panda","doi":"10.1145/2600212.2600715","DOIUrl":"https://doi.org/10.1145/2600212.2600715","url":null,"abstract":"In this paper, we propose SOR-HDFS, a SEDA (Staged Event-Driven Architecture)-based approach to improve the performance of HDFS Write operation. This design not only incorporates RDMA-based communication over InfiniBand but also maximizes overlapping among different stages of data transfer and I/O. Performance evaluations show that, the new design improves the aggregated write throughput of Enhanced DFSIO benchmark in Intel HiBench by up to 64% and reduces the job execution time by 37% compared to IPoIB (IP over InfiniBand). Compared to the previous best RDMA-enhanced design [4], the improvements in throughput and execution time are 30% and 20%, respectively. Our design can also improve the performance of HBase Put operation by up to 53% over IPoIB and 29% compared to the previous best RDMA-enhanced HDFS. To the best of our knowledge, this is the first design of SEDA-based HDFS in the literature.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126489834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Baker, Haiying Xu, J. Dennis, M. Levy, D. Nychka, S. Mickelson, Jim Edwards, M. Vertenstein, Al Wegener
{"title":"A methodology for evaluating the impact of data compression on climate simulation data","authors":"A. Baker, Haiying Xu, J. Dennis, M. Levy, D. Nychka, S. Mickelson, Jim Edwards, M. Vertenstein, Al Wegener","doi":"10.1145/2600212.2600217","DOIUrl":"https://doi.org/10.1145/2600212.2600217","url":null,"abstract":"High-resolution climate simulations require tremendous computing resources and can generate massive datasets. At present, preserving the data from these simulations consumes vast storage resources at institutions such as the National Center for Atmospheric Research (NCAR). The historical data generation trends are economically unsustainable, and storage resources are already beginning to limit science objectives. To mitigate this problem, we investigate the use of data compression techniques on climate simulation data from the Community Earth System Model. Ultimately, to convince climate scientists to compress their simulation data, we must be able to demonstrate that the reconstructed data reveals the same mean climate as the original data, and this paper is a first step toward that goal. To that end, we develop an approach for verifying the climate data and use it to evaluate several compression algorithms. We find that the diversity of the climate data requires the individual treatment of variables, and, in doing so, the reconstructed data can fall within the natural variability of the system, while achieving compression rates of up to 5:1.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131443385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}