Wolfgang Gerlach, Wei Tang, Kevin P. Keegan, Travis Harrison, Andreas Wilke, Jared Bischof, M. D'Souza, Scott Devoid, Daniel Murphy-Olson, N. Desai, Folker Meyer
{"title":"Skyport - Container-Based Execution Environment Management for Multi-cloud Scientific Workflows","authors":"Wolfgang Gerlach, Wei Tang, Kevin P. Keegan, Travis Harrison, Andreas Wilke, Jared Bischof, M. D'Souza, Scott Devoid, Daniel Murphy-Olson, N. Desai, Folker Meyer","doi":"10.1109/DataCloud.2014.6","DOIUrl":"https://doi.org/10.1109/DataCloud.2014.6","url":null,"abstract":"Recently, Linux container technology has been gaining attention as it promises to transform the way software is developed and deployed. The portability and ease of deployment makes Linux containers an ideal technology to be used in scientific workflow platforms. Skyport utilizes Docker Linux containers to solve software deployment problems and resource utilization inefficiencies inherent to all existing scientific workflow platforms. As an extension to AWE/Shock, our data analysis platform that provides scalable workflow execution environments for scientific data in the cloud, Skyport greatly reduces the complexity associated with providing the environment necessary to execute complex workflows.","PeriodicalId":121831,"journal":{"name":"2014 5th International Workshop on Data-Intensive Computing in the Clouds","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133035474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefan Ene, Bogdan Nicolae, Alexandru Costan, Gabriel Antoniu
{"title":"To Overlap or Not to Overlap: Optimizing Incremental MapReduce Computations for On-Demand Data Upload","authors":"Stefan Ene, Bogdan Nicolae, Alexandru Costan, Gabriel Antoniu","doi":"10.1109/DataCloud.2014.7","DOIUrl":"https://doi.org/10.1109/DataCloud.2014.7","url":null,"abstract":"Research on cloud-based Big Data analytics has focused so far on optimizing the performance and cost-effectiveness of the computations, while largely neglecting an important aspect: users need to upload massive datasets on clouds for their computations. This paper studies the problem of running MapReduce applications when considering the simultaneous optimization of performance and cost of both the data upload and its corresponding computation taken together. We analyze the feasibility of incremental MapReduce approaches to advance the computation as much as possible during the data upload by using already transferred data to calculate intermediate results. Our key finding shows that overlapping the transfer time with as many incremental computations as possible is not always efficient: a better solution is to wait for enough to fill the computational capacity of the MapReduce cluster. Results show significant performance and cost reduction compared with state-of-the-art solutions that leverage incremental computations in a naive fashion.","PeriodicalId":121831,"journal":{"name":"2014 5th International Workshop on Data-Intensive Computing in the Clouds","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132493271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Locality and Network-Aware Reduce Task Scheduling for Data-Intensive Applications","authors":"Engin Arslan, Mrigank Shekhar, T. Kosar","doi":"10.1109/DataCloud.2014.10","DOIUrl":"https://doi.org/10.1109/DataCloud.2014.10","url":null,"abstract":"MapReduce is one of the leading programming frameworks to implement data-intensive applications by splitting the map and reduce tasks to distributed servers. Although there has been substantial amount of work on map task scheduling and optimization in the literature, the work on reduce task scheduling is very limited. Effective scheduling of the reduce tasks to the resources becomes especially important for the performance of data-intensive applications where large amounts of data are moved between the map and reduce tasks. In this paper, we propose a new algorithm (LoNARS) for reduce task scheduling, which takes both data locality and network traffic into consideration. Data locality awareness aims to schedule the reduce tasks closer to the map tasks to decrease the delay in data access as well as the amount of traffic pushed to the network. Network traffic awareness intends to distribute the traffic over the whole network and minimize the hotspots to reduce the effect of network congestion in data transfers. We have integrated LoNARS into Hadoop-1.2.1. Using our LoNARS algorithm, we achieved up to 15% gain in data shuffling time and up to 3-4% improvement in total job completion time compared to the other reduce task scheduling algorithms. Moreover, we reduced the amount of traffic on network switches by 15% which helps to save energy consumption considerably.","PeriodicalId":121831,"journal":{"name":"2014 5th International Workshop on Data-Intensive Computing in the Clouds","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115287693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Scalable Distributed Graph Database Engine for Hybrid Clouds","authors":"Miyuru Dayarathna, T. Suzumura","doi":"10.1109/DataCloud.2014.9","DOIUrl":"https://doi.org/10.1109/DataCloud.2014.9","url":null,"abstract":"Large graph data management and mining in clouds has become an important issue in recent times. We propose Acacia which is a distributed graph database engine for scalable handling of such large graph data. Acacia operates between the boundaries of private and public clouds. Acacia partitions and stores the graph data in the private cloud during its initial deployment. Acacia bursts into the public cloud when the resources of the private cloud are insufficient to maintain its service-level agreements. We implement Acacia using X10 programming language. We describe how Top-K PageRank has been implemented in Acacia. We report preliminary experiment results conducted with Acacia on a small compute cluster. Acacia is able to upload 69 million edges LiveJournal social network data set in about 10 minutes. Furthermore, Acacia calculates the average out degree of vertices of LiveJournal graph in 2 minutes. These results indicate Acacias potential for handling large graphs.","PeriodicalId":121831,"journal":{"name":"2014 5th International Workshop on Data-Intensive Computing in the Clouds","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117258048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Integrating Pig with Harp to Support Iterative Applications with Fast Cache and Customized Communication","authors":"T. Wu, A. Koppula, J. Qiu","doi":"10.1109/DataCloud.2014.8","DOIUrl":"https://doi.org/10.1109/DataCloud.2014.8","url":null,"abstract":"Use of high-level scripting languages to solve big data problems has become a mainstream approach for sophisticated machine learning data analysis. Often data must be used in several steps of a computation to complete a full task. Composing default data transformation operators with the standard Hadoop MapReduce runtime is very convenient. However, the current strategy of using high-level languages to support iterative applications with Hadoop MapReduce relies on an external wrapper script in other languages such as Python and Groovy, which causes significant performance loss when restarting mappers and reducers between jobs. In this paper, we reduce the extra job startup overheads by integrating Apache Pig with the high-performance Hadoop plug-in Harp developed at Indiana University. This provides fast data caching and customized communication patterns among iterations for data analysis. The results show performance improvements of factors from 2 to 5.","PeriodicalId":121831,"journal":{"name":"2014 5th International Workshop on Data-Intensive Computing in the Clouds","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131118961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}