Ge Zhang, Haozhan Wang, Zhongzhi Luan, Weiguo Wu, D. Qian
{"title":"Improving Performance for Geo-Distributed Data Process in Wide-Area","authors":"Ge Zhang, Haozhan Wang, Zhongzhi Luan, Weiguo Wu, D. Qian","doi":"10.1109/CIT.2017.48","DOIUrl":null,"url":null,"abstract":"Many organizations and end sensors produce massive data around the globe. To analyze the data as a whole, the traditional way is to copy all data to a central datacenter for analysis. This is neither practical nor efficient as the huge transfer data size and the limited network bandwidth. What's more, the data privacy may also matters. Instead of transferring data, we believe moving the computation to where the data is can be a better way to solve this problem. In this paper, we design an algorithm for geo-distributed big data process which is both data-aware and network-aware. Considering the computation's characteristics, we take advantage of data dependency to find out the data locality. And use the integer linear programming (ILP) to achieve network-aware. The implementation of our algorithm is on the top of Spark. We improve the performance of geo-distributed data process by 22% in our experiments.","PeriodicalId":378423,"journal":{"name":"2017 IEEE International Conference on Computer and Information Technology (CIT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Computer and Information Technology (CIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIT.2017.48","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Many organizations and end sensors produce massive data around the globe. To analyze the data as a whole, the traditional way is to copy all data to a central datacenter for analysis. This is neither practical nor efficient as the huge transfer data size and the limited network bandwidth. What's more, the data privacy may also matters. Instead of transferring data, we believe moving the computation to where the data is can be a better way to solve this problem. In this paper, we design an algorithm for geo-distributed big data process which is both data-aware and network-aware. Considering the computation's characteristics, we take advantage of data dependency to find out the data locality. And use the integer linear programming (ILP) to achieve network-aware. The implementation of our algorithm is on the top of Spark. We improve the performance of geo-distributed data process by 22% in our experiments.