{"title":"加速Spark Shuffle与RDMA","authors":"Bing Liu, Fang Liu, Nong Xiao, Zhiguang Chen","doi":"10.1109/NAS.2018.8515724","DOIUrl":null,"url":null,"abstract":"Apache Spark is a lightning-fast unified analytics engine for large-scale data processing. When executing an application with Spark, it runs many jobs in parallel. These jobs are divided into stages based on the shuffle boundary. However, shuffling data across the stages in a cluster is time-consuming because it will place significant burden on operating system on both the source and the destination by requiring many remote files and network I/Os. Meanwhile, the latest Spark is based on Netty which is written with Java Sockets and will produce a large number of data copies during the shuffle phase. This has become the major bottleneck for Apache Spark and motivates us to use RDMA technology to accelerate data shuffle. RDMA, with the function of zero-copy transfers, reducing latency and CPU overhead, can reduce stress on operating system during the shuffle phase and improve the throughput of the whole system. In this paper, we present a high-performance RDMA-based design for accelerating data shuffle in Apache Spark framework by providing tiering memory pool and different mechanisms to transfer messages of different sizes. The experimental results show that compared to the default Spark running with IP over InfiniBand (IPoIB), our proposed design can achieve up to 89.8% performance improvement for Spark RDD operation benchmarks (e.g., GroupBy and SortBy), up to 49% performance improvement for iterative algorithms (e.g., TriangleCount and SVM in SparkBench). And the evaluation results also show that our RDMA-based design slightly outperforms Crail-Spark-IO, a recent open-source Spark shuffle plugin from IBM.","PeriodicalId":115970,"journal":{"name":"2018 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"201 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accelerating Spark Shuffle with RDMA\",\"authors\":\"Bing Liu, Fang Liu, Nong Xiao, Zhiguang Chen\",\"doi\":\"10.1109/NAS.2018.8515724\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Apache Spark is a lightning-fast unified analytics engine for large-scale data processing. When executing an application with Spark, it runs many jobs in parallel. These jobs are divided into stages based on the shuffle boundary. However, shuffling data across the stages in a cluster is time-consuming because it will place significant burden on operating system on both the source and the destination by requiring many remote files and network I/Os. Meanwhile, the latest Spark is based on Netty which is written with Java Sockets and will produce a large number of data copies during the shuffle phase. This has become the major bottleneck for Apache Spark and motivates us to use RDMA technology to accelerate data shuffle. RDMA, with the function of zero-copy transfers, reducing latency and CPU overhead, can reduce stress on operating system during the shuffle phase and improve the throughput of the whole system. In this paper, we present a high-performance RDMA-based design for accelerating data shuffle in Apache Spark framework by providing tiering memory pool and different mechanisms to transfer messages of different sizes. The experimental results show that compared to the default Spark running with IP over InfiniBand (IPoIB), our proposed design can achieve up to 89.8% performance improvement for Spark RDD operation benchmarks (e.g., GroupBy and SortBy), up to 49% performance improvement for iterative algorithms (e.g., TriangleCount and SVM in SparkBench). And the evaluation results also show that our RDMA-based design slightly outperforms Crail-Spark-IO, a recent open-source Spark shuffle plugin from IBM.\",\"PeriodicalId\":115970,\"journal\":{\"name\":\"2018 IEEE International Conference on Networking, Architecture and Storage (NAS)\",\"volume\":\"201 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Conference on Networking, Architecture and Storage (NAS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NAS.2018.8515724\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Networking, Architecture and Storage (NAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NAS.2018.8515724","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Apache Spark is a lightning-fast unified analytics engine for large-scale data processing. When executing an application with Spark, it runs many jobs in parallel. These jobs are divided into stages based on the shuffle boundary. However, shuffling data across the stages in a cluster is time-consuming because it will place significant burden on operating system on both the source and the destination by requiring many remote files and network I/Os. Meanwhile, the latest Spark is based on Netty which is written with Java Sockets and will produce a large number of data copies during the shuffle phase. This has become the major bottleneck for Apache Spark and motivates us to use RDMA technology to accelerate data shuffle. RDMA, with the function of zero-copy transfers, reducing latency and CPU overhead, can reduce stress on operating system during the shuffle phase and improve the throughput of the whole system. In this paper, we present a high-performance RDMA-based design for accelerating data shuffle in Apache Spark framework by providing tiering memory pool and different mechanisms to transfer messages of different sizes. The experimental results show that compared to the default Spark running with IP over InfiniBand (IPoIB), our proposed design can achieve up to 89.8% performance improvement for Spark RDD operation benchmarks (e.g., GroupBy and SortBy), up to 49% performance improvement for iterative algorithms (e.g., TriangleCount and SVM in SparkBench). And the evaluation results also show that our RDMA-based design slightly outperforms Crail-Spark-IO, a recent open-source Spark shuffle plugin from IBM.