{"title":"Improving Spark performance with zero-copy buffer management and RDMA","authors":"Hu Li, Tian-Li Chen, W. Xu","doi":"10.1109/INFCOMW.2016.7562041","DOIUrl":null,"url":null,"abstract":"With the ever increasing demand on interactive data analytics, latency for big data frameworks becomes more important. We present our preliminary experience designing and implementing NetSpark, an improved Spark [1] framework that is highly optimized for network latency. Combining optimizations on data serialization, network buffer management with hardware-supported Remote Direct Memory Access (RDMA) technology, we show that we can eliminate most of the data copies from end to end, significantly reducing the Spark task running time. Our preliminary experiments show that NetSpark improves GroupBy operation in Spark by about 40% and the PageRank algorithm in GraphX by about 20% on a 10Gbps data center network over the legacy network stack.","PeriodicalId":348177,"journal":{"name":"2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INFCOMW.2016.7562041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
With the ever increasing demand on interactive data analytics, latency for big data frameworks becomes more important. We present our preliminary experience designing and implementing NetSpark, an improved Spark [1] framework that is highly optimized for network latency. Combining optimizations on data serialization, network buffer management with hardware-supported Remote Direct Memory Access (RDMA) technology, we show that we can eliminate most of the data copies from end to end, significantly reducing the Spark task running time. Our preliminary experiments show that NetSpark improves GroupBy operation in Spark by about 40% and the PageRank algorithm in GraphX by about 20% on a 10Gbps data center network over the legacy network stack.