{"title":"Improving MapReduce Performance in a Heterogeneous Cloud: A Measurement Study","authors":"Xu Zhao, Ling Liu, Qi Zhang, Xiaoshe Dong","doi":"10.1109/CLOUD.2014.61","DOIUrl":"https://doi.org/10.1109/CLOUD.2014.61","url":null,"abstract":"Hybrid clouds, geo-distributed cloud and continuous upgrades of computing, storage and networking resources in the cloud have driven datacenters evolving towards heterogeneous clusters. Unfortunately, most of MapReduce implementations are designed for homogeneous computing environments and perform poorly in heterogeneous clusters. Although a fair of research efforts have dedicated to improve MapReduce performance, there still lacks of in-depth understanding of the key factors that affect the performance of MapReduce jobs in heterogeneous clusters. In this paper, we present an extensive experimental study on two categories of factors: system configuration and task scheduling. Our measurement study shows that an in-depth understanding of these factors is critical for improving MapReduce performance in a heterogeneous environment. We conclude with five key findings: (1) Early shuffle, though effective for reducing the latency of MapReduce jobs, can impact the performance of map tasks and reduce tasks differently when running on different types of nodes. (2) Two phases in map tasks have different sensitive to input block size and the ratio of sort phase with different block size is different for different type of nodes. (3) Scheduling map or reduce tasks dynamically with node capacity and workload awareness can further enhance the job performance and improve resource consumption efficiency. (4) Although random scheduling of reduce tasks works well in homogeneous clusters, it can significantly degrade the performance in heterogeneous clusters when shuffled data size is large. (5) Phase-aware progress rate estimation and speculation strategy can provide substantial performance gain over the state of art speculation scheduler.","PeriodicalId":288542,"journal":{"name":"2014 IEEE 7th International Conference on Cloud Computing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127771556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahsanul Haque, Brandon Parker, L. Khan, B. Thuraisingham
{"title":"Evolving Big Data Stream Classification with MapReduce","authors":"Ahsanul Haque, Brandon Parker, L. Khan, B. Thuraisingham","doi":"10.1109/CLOUD.2014.82","DOIUrl":"https://doi.org/10.1109/CLOUD.2014.82","url":null,"abstract":"Big Data Stream mining has some inherent challenges which are not present in traditional data mining. Not only Big Data Stream receives large volume of data continuously, but also it may have different types of features. Moreover, the concepts and features tend to evolve throughout the stream. Traditional data mining techniques are not sufficient to address these challenges. In our current work, we have designed a multi-tiered ensemble based method HSMiner to address aforementioned challenges to label instances in an evolving Big Data Stream. However, this method requires building large number of AdaBoost ensembles for each of the numeric features after receiving each new data chunk which is very costly. Thus, HSMiner may face scalability issue in case of classifying Big Data Stream. To address this problem, we propose three approaches to build these large number of AdaBoost ensembles using MapReduce based parallelism. We compare each of these approaches from different aspects of design. We also empirically show that, these approaches are very useful for our base method to achieve significant scalability and speedup.","PeriodicalId":288542,"journal":{"name":"2014 IEEE 7th International Conference on Cloud Computing","volume":"99 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132871433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Soramichi Akiyama, Takahiro Hirofuchi, Ryousei Takano, S. Honiden
{"title":"Fast Live Migration with Small IO Performance Penalty by Exploiting SAN in Parallel","authors":"Soramichi Akiyama, Takahiro Hirofuchi, Ryousei Takano, S. Honiden","doi":"10.1109/CLOUD.2014.16","DOIUrl":"https://doi.org/10.1109/CLOUD.2014.16","url":null,"abstract":"Virtualization techniques greatly benefit cloud computing. Live migration enables a datacenter to dynamically replace virtual machines (VMs) without disrupting services running on them. Efficient live migration is the key to improve the energy efficiency and resource utilization of a datacenter through dynamic placement of VMs. Recent studies have achieved efficient live migration by deleting the page cache of the guest OS to shrink the memory size of it before a migration. However, these studies do not solve the problem of IO performance penalty after a migration due to the loss of page cache. We propose an advanced memory transfer mechanism for live migration, which skips transferring the page cache to shorten total migration time while restoring it transparently from the guest OS via the SAN to prevent IO performance penalty. To start a migration, our mechanism collects the mapping information between page cache and disk blocks. During a migration, the source host skips transferring the page cache but transfers other memory content, while the destination host transfers the same data as the page cache from the disk blocks via the SAN. Experiments with web server and database workloads showed that our mechanism reduced total migration time with significantly small IO performance penalty.","PeriodicalId":288542,"journal":{"name":"2014 IEEE 7th International Conference on Cloud Computing","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131920546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FRESH: Fair and Efficient Slot Configuration and Scheduling for Hadoop Clusters","authors":"Jiayin Wang, Yi Yao, Ying Mao, B. Sheng, N. Mi","doi":"10.1109/CLOUD.2014.106","DOIUrl":"https://doi.org/10.1109/CLOUD.2014.106","url":null,"abstract":"Hadoop is an emerging framework for parallel big data processing. While becoming popular, Hadoop is too complex for regular users to fully understand all the system parameters and tune them appropriately. Especially when processing a batch of jobs, default Hadoop setting may cause inefficient resource utilization and unnecessarily prolong the execution time. This paper considers an extremely important setting of slot configuration which by default is fixed and static. We proposed an enhanced Hadoop system called FRESH which can derive the best slot setting, dynamically configure slots, and appropriately assign tasks to the available slots. The experimental results show that when serving a batch of MapReduce jobs, FRESH significantly improves the makespan as well as the fairness among jobs.","PeriodicalId":288542,"journal":{"name":"2014 IEEE 7th International Conference on Cloud Computing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131601312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating Dynamic Resource Allocation Strategies in Virtualized Data Centers","authors":"A. Wolke, Lukas Ziegler","doi":"10.1109/CLOUD.2014.52","DOIUrl":"https://doi.org/10.1109/CLOUD.2014.52","url":null,"abstract":"Virtualization technology allows a dynamic allocation of VMs to servers. It reduces server demand and increases energy efficiency of data centers. Dynamic control strategies migrate VMs between servers in dependence to their actual workload. A concept that promises further improvements in VM allocation efficiency. In this paper we evaluate the applicability of DSAP in a deterministic environment. DSAP is a linear program, calculating VM allocations and live-migrations on workload patterns known a priori. Efficiency is evaluated by simulations as well as an experimental test bed infrastructure. Results are compared against alternative control approaches that we studied in preliminary works. Our findings are, dynamic allocation can reduce server demand at a reasonable service quality. Countermeasures are required to keep the number of live-migrations under control.","PeriodicalId":288542,"journal":{"name":"2014 IEEE 7th International Conference on Cloud Computing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132290885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Ruehl, Malte Rupprecht, Bjorn Morr, Matthias Reinhardt, S. Verclas
{"title":"Mixed-Tenancy in the Wild - Applicability of Mixed-Tenancy for Real-World Enterprise SaaS-Applications","authors":"S. Ruehl, Malte Rupprecht, Bjorn Morr, Matthias Reinhardt, S. Verclas","doi":"10.1109/CLOUD.2014.119","DOIUrl":"https://doi.org/10.1109/CLOUD.2014.119","url":null,"abstract":"Software-as-a-Service (SaaS) is a delivery model whose basic idea is to provide applications to the customer on demand over the Internet. SaaS thereby promotes multi-tenancy as a tool to exploit economies of scale. This means that a single application instance serves multiple customers. However, a major drawback of SaaS is the customers' hesitation of sharing infrastructure, application code, or data with other tenants. This is due to the fact that one of the major threats of multi-tenancy is information disclosure due to a system malfunction, system error, or aggressive actions. So far the only approach in research to counteract on this hesitation has been to enhance the isolation between tenants using the same instance. Our approach (presented in earlier work) tackles this hesitation differently. It allows customers to choose if or even with whom they want to share the application. The approach enables the customer to define their constraints for individual application components and the underlying infrastructure. The contribution of this paper is an analysis of real-world applicability of the mixed-tenancy approach. This is done experimentally by applying the mixed-tenancy approach to OpenERP, an open source enterprise resource planning system used in industry. The conclusion gained from this experiment is that the mixed-tenancy approach is technically realizable for cases of the real world. However, there are scenarios where the mixed-tenancy approach is not economically worthwhile for the operator.","PeriodicalId":288542,"journal":{"name":"2014 IEEE 7th International Conference on Cloud Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130767523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MediaPaaS: A Cloud-Based Media Processing Platform for Elastic Live Broadcasting","authors":"Bin Cheng","doi":"10.1109/CLOUD.2014.100","DOIUrl":"https://doi.org/10.1109/CLOUD.2014.100","url":null,"abstract":"Mobility is changing the way of how people consume live media content. By staying always connected with the Internet from various mobile devices, people expect to have enhanced TV viewing experience from anywhere on any device. Therefore, live broadcasting needs to be widely accessible and customizable, instead of being passive content only on TV. In this paper we present a cloud-based media processing platform, called MediaPaaS, for enabling elastic live broadcasting in the cloud. As an ecosystem-oriented solution for content providers, we outsource complex media processing from both content providers and terminal devices to the cloud. A distributed media processing model is proposed to enable dynamic pipeline composition and cross-pipeline task sharing in the cloud for flexible live content processing. Also, a prediction-based task scheduling algorithm is presented to minimize cloud resource usage without affecting quality of streams. The MediaPaaS platform allows third-party application developers to extend its capability to enable certain customization for running live channels. To our knowledge, this paper is the first work to openly discuss the detailed design issues of a cloud-based platform for elastic live broadcasting.","PeriodicalId":288542,"journal":{"name":"2014 IEEE 7th International Conference on Cloud Computing","volume":"69 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114036961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introducing SSDs to the Hadoop MapReduce Framework","authors":"Sangwhan Moon, J. Lee, Yang-Suk Kee","doi":"10.1109/CLOUD.2014.45","DOIUrl":"https://doi.org/10.1109/CLOUD.2014.45","url":null,"abstract":"Solid State Drive (SSD) cost-per-bit continues to decrease. Consequently, system architects increasingly consider replacing Hard Disk Drives (HDDs) with SSDs to accelerate Hadoop MapReduce processing. When attempting this, system architects usually realize that SSD characteristics and today's Hadoop framework exhibit mismatches that impede indiscriminate SSD integration. Hence, cost-effective SSD utilization has proved challenging within many Hadoop environments. This paper compares SSD performance to HDD performance within a Hadoop MapReduce framework. It identifies extensible best practices that can exploit SSD benefits within Hadoop frameworks when combined with high network bandwidth and increased parallel storage access. Terasort benchmark results demonstrate that SSDs presently deliver significant cost-effectiveness when they store intermediate Hadoop data, leaving HDDs to store Hadoop Distributed File System (HDFS) source data.","PeriodicalId":288542,"journal":{"name":"2014 IEEE 7th International Conference on Cloud Computing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114066274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Progger: An Efficient, Tamper-Evident Kernel-Space Logger for Cloud Data Provenance Tracking","authors":"R. Ko, M. Will","doi":"10.1109/CLOUD.2014.121","DOIUrl":"https://doi.org/10.1109/CLOUD.2014.121","url":null,"abstract":"Cloud data provenance, or \"what has happened to my data in the cloud\", is a critical data security component which addresses pressing data accountability and data governance issues in cloud computing systems. In this paper, we present Progger (Provenance Logger), a kernel-space logger which potentially empowers all cloud stakeholders to trace their data. Logging from the kernel space empowers security analysts to collect provenance from the lowest possible atomic data actions, and enables several higher-level tools to be built for effective end-to-end tracking of data provenance. Within the last few years, there has been an increasing number of proposed kernel space provenance tools but they faced several critical data security and integrity problems. Some of these prior tools' limitations include (1) the inability to provide log tamper-evidence and prevention of fake/manual entries, (2) accurate and granular timestamp synchronisation across several machines, (3) log space requirements and growth, and (4) the efficient logging of root usage of the system. Progger has resolved all these critical issues, and as such, provides high assurance of data security and data activity audit. With this in mind, the paper will discuss these elements of high-assurance cloud data provenance, describe the design of Progger and its efficiency, and present compelling results which paves the way for Progger being a foundation tool used for data activity tracking across all cloud systems.","PeriodicalId":288542,"journal":{"name":"2014 IEEE 7th International Conference on Cloud Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121060500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Note on Verifiable Privacy-Preserving Tries","authors":"Zachary A. Kissel, Jie Wang","doi":"10.1109/CLOUD.2014.134","DOIUrl":"https://doi.org/10.1109/CLOUD.2014.134","url":null,"abstract":"We describe a security flaw in the construction of the privacy-preserving trie presented in an ICC'12 paper. The flaw allows a semi-honest-but-curious cloud to forge a verifiable dictionary entry with a set of documents that do not contain the keyword in the query. We then proceed to offer a fix.","PeriodicalId":288542,"journal":{"name":"2014 IEEE 7th International Conference on Cloud Computing","volume":"233 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126806787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}