{"title":"在异构架构上测试dpc++代码和性能可移植性","authors":"Nenad Mijić, D. Davidovic","doi":"10.23919/MIPRO57284.2023.10159832","DOIUrl":null,"url":null,"abstract":"Source code portability is becoming increasingly important in the development of new solutions in HPC due to the wide diversification of hardware and heterogeneity of systems. With Intel’s oneAPI suite of programming tools and the Data Parallel C++ compiler, a single source code containing both host and device code can leverage hardware architectures from different vendors. Using the compiler’s interoperability, it can be linked to existing libraries such as MPI to run the program on a distributed memory system. In this paper we benchmark and analyze the performance that can be achieved with the Intel DPC++ compiler, using the distributed Cholesky QR2 algorithm as an example and comparing it with the native CUDA and C++ implementation. The analysis shows that the performance degradation when using SYCL is negligible when a smaller number of nodes are used, but with the cost that some additional self-made optimizations are required in SYCL code.","PeriodicalId":177983,"journal":{"name":"2023 46th MIPRO ICT and Electronics Convention (MIPRO)","volume":"151 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Benchmark DPC++ code and performance portability on heterogeneous architectures\",\"authors\":\"Nenad Mijić, D. Davidovic\",\"doi\":\"10.23919/MIPRO57284.2023.10159832\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Source code portability is becoming increasingly important in the development of new solutions in HPC due to the wide diversification of hardware and heterogeneity of systems. With Intel’s oneAPI suite of programming tools and the Data Parallel C++ compiler, a single source code containing both host and device code can leverage hardware architectures from different vendors. Using the compiler’s interoperability, it can be linked to existing libraries such as MPI to run the program on a distributed memory system. In this paper we benchmark and analyze the performance that can be achieved with the Intel DPC++ compiler, using the distributed Cholesky QR2 algorithm as an example and comparing it with the native CUDA and C++ implementation. The analysis shows that the performance degradation when using SYCL is negligible when a smaller number of nodes are used, but with the cost that some additional self-made optimizations are required in SYCL code.\",\"PeriodicalId\":177983,\"journal\":{\"name\":\"2023 46th MIPRO ICT and Electronics Convention (MIPRO)\",\"volume\":\"151 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 46th MIPRO ICT and Electronics Convention (MIPRO)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/MIPRO57284.2023.10159832\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 46th MIPRO ICT and Electronics Convention (MIPRO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/MIPRO57284.2023.10159832","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Benchmark DPC++ code and performance portability on heterogeneous architectures
Source code portability is becoming increasingly important in the development of new solutions in HPC due to the wide diversification of hardware and heterogeneity of systems. With Intel’s oneAPI suite of programming tools and the Data Parallel C++ compiler, a single source code containing both host and device code can leverage hardware architectures from different vendors. Using the compiler’s interoperability, it can be linked to existing libraries such as MPI to run the program on a distributed memory system. In this paper we benchmark and analyze the performance that can be achieved with the Intel DPC++ compiler, using the distributed Cholesky QR2 algorithm as an example and comparing it with the native CUDA and C++ implementation. The analysis shows that the performance degradation when using SYCL is negligible when a smaller number of nodes are used, but with the cost that some additional self-made optimizations are required in SYCL code.