{"title":"单gpu和多gpu图形处理中的位置感知线程块设计","authors":"Quan Fan, Zizhong Chen","doi":"10.1109/nas51552.2021.9605484","DOIUrl":null,"url":null,"abstract":"Graphics Processing Unit (GPU) has been adopted to process graphs effectively. Recently, multi-GPU systems are also exploited for greater performance boost. To process graphs on multiple GPUs in parallel, input graphs should be partitioned into parts using partitioning schemes. The partitioning schemes can impact the communication overhead, locality of memory accesses, and further improve the overall performance. We found that both intra-GPU data sharing and inter-GPU communication can be summarized as inter-TB communication. Based on this key idea, we propose a new graph partitioning scheme by redefining the input graph as a TB Graph with calculated vertex and edge weights, and then partition it to reduce intra & inter-GPU communication overhead and improve the locality at the granularity of Thread Blocks (TB). We also propose to develop a partitioning and mapping scheme for heterogeneous architectures including physical links with different bandwidths. The experimental results on graph partitioning show that our scheme is effective to improve the overall performance of the Breadth First Search (BFS) by up to 33%.","PeriodicalId":135930,"journal":{"name":"2021 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Locality-aware Thread Block Design in Single and Multi-GPU Graph Processing\",\"authors\":\"Quan Fan, Zizhong Chen\",\"doi\":\"10.1109/nas51552.2021.9605484\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphics Processing Unit (GPU) has been adopted to process graphs effectively. Recently, multi-GPU systems are also exploited for greater performance boost. To process graphs on multiple GPUs in parallel, input graphs should be partitioned into parts using partitioning schemes. The partitioning schemes can impact the communication overhead, locality of memory accesses, and further improve the overall performance. We found that both intra-GPU data sharing and inter-GPU communication can be summarized as inter-TB communication. Based on this key idea, we propose a new graph partitioning scheme by redefining the input graph as a TB Graph with calculated vertex and edge weights, and then partition it to reduce intra & inter-GPU communication overhead and improve the locality at the granularity of Thread Blocks (TB). We also propose to develop a partitioning and mapping scheme for heterogeneous architectures including physical links with different bandwidths. The experimental results on graph partitioning show that our scheme is effective to improve the overall performance of the Breadth First Search (BFS) by up to 33%.\",\"PeriodicalId\":135930,\"journal\":{\"name\":\"2021 IEEE International Conference on Networking, Architecture and Storage (NAS)\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Networking, Architecture and Storage (NAS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/nas51552.2021.9605484\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Networking, Architecture and Storage (NAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/nas51552.2021.9605484","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Locality-aware Thread Block Design in Single and Multi-GPU Graph Processing
Graphics Processing Unit (GPU) has been adopted to process graphs effectively. Recently, multi-GPU systems are also exploited for greater performance boost. To process graphs on multiple GPUs in parallel, input graphs should be partitioned into parts using partitioning schemes. The partitioning schemes can impact the communication overhead, locality of memory accesses, and further improve the overall performance. We found that both intra-GPU data sharing and inter-GPU communication can be summarized as inter-TB communication. Based on this key idea, we propose a new graph partitioning scheme by redefining the input graph as a TB Graph with calculated vertex and edge weights, and then partition it to reduce intra & inter-GPU communication overhead and improve the locality at the granularity of Thread Blocks (TB). We also propose to develop a partitioning and mapping scheme for heterogeneous architectures including physical links with different bandwidths. The experimental results on graph partitioning show that our scheme is effective to improve the overall performance of the Breadth First Search (BFS) by up to 33%.