{"title":"Efficient Data-Parallel Primitives on Heterogeneous Systems","authors":"Zhuohang Lai, Qiong Luo, Xiaolong Xie","doi":"10.1145/3337821.3337920","DOIUrl":null,"url":null,"abstract":"Data-parallel primitives, such as gather, scatter, scan, and split, are widely used in data-intensive applications. However, it is challenging to optimize them on a system consisting of heterogeneous processors. In this paper, we study and compare the existing implementations and optimization strategies for a set of data-parallel primitives on three processors: GPU, CPU and Xeon Phi co-processor. Our goal is to identify the key performance factors in the implementations of data-parallel primitive operations on different architectures and develop general strategies for implementing these primitives efficiently on various platforms. We introduce a portable and efficient sequential memory access pattern, which eliminates the cost of adjusting the memory access pattern for individual device. With proper tuning, our optimized primitive implementations can achieve comparable performance to the native versions. Moreover, our profiling results show that the CPU and the Phi co-processor share most optimization strategies whereas the GPU differs from them significantly, due to the hardware differences among these devices, such as efficiency of vectorization, data and TLB caching, and data prefetching. We summarize these factors and deliver common primitive optimization strategies for heterogeneous systems.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 48th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3337821.3337920","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Data-parallel primitives, such as gather, scatter, scan, and split, are widely used in data-intensive applications. However, it is challenging to optimize them on a system consisting of heterogeneous processors. In this paper, we study and compare the existing implementations and optimization strategies for a set of data-parallel primitives on three processors: GPU, CPU and Xeon Phi co-processor. Our goal is to identify the key performance factors in the implementations of data-parallel primitive operations on different architectures and develop general strategies for implementing these primitives efficiently on various platforms. We introduce a portable and efficient sequential memory access pattern, which eliminates the cost of adjusting the memory access pattern for individual device. With proper tuning, our optimized primitive implementations can achieve comparable performance to the native versions. Moreover, our profiling results show that the CPU and the Phi co-processor share most optimization strategies whereas the GPU differs from them significantly, due to the hardware differences among these devices, such as efficiency of vectorization, data and TLB caching, and data prefetching. We summarize these factors and deliver common primitive optimization strategies for heterogeneous systems.