T. Ta, Xianwei Zhang, Anthony Gutierrez, Bradford M. Beckmann
{"title":"自主数据无竞争GPU测试","authors":"T. Ta, Xianwei Zhang, Anthony Gutierrez, Bradford M. Beckmann","doi":"10.1109/IISWC47752.2019.9042019","DOIUrl":null,"url":null,"abstract":"As the deep learning and high-performance computing markets continue to grow, hardware designers are increasingly optimizing future GPUs to run compute (a.k.a. GPGPU) workloads. A key area of optimization for these compute-oriented designs, which was not emphasized when GPUs exclusively executed graphics workloads, is inter-thread data sharing and synchronization. GPU cache coherence protocols now support these operations and are governed by a specified memory consistency model. In general, current GPU models are based on sequential consistency for data-race-free (SC for DRF), which mandates data written to memory must be globally visible only after certain synchronization points. GPU coherence protocols based on such relaxed memory models are particularly difficult to design and test due to the large number of memory accesses that may be reordered. This leaves GPU hardware designers struggling to validate the correctness of GPU cache coherence optimizations. To address this issue, this paper introduces a novel, completely autonomous random testing methodology for complex GPU cache coherence protocols. Our framework continuously generates sequences of memory requests with minimal user intervention using a mix of load, store, and atomic operations. The tester dynamically and autonomously checks each response against an expected global view of memory and immediately detects any inconsistencies in a target coherence protocol, providing designers detailed feedback on the issue. We then demonstrate the methodology on the popular cycle-level gem5 simulator by replacing its GPU core model with our unique testing framework. The results show that the GPU tester can cover 94% and 100% of all reachable state transitions in L1 and L2 caches respectively of a representative GPU coherence protocol. This coverage is 6.25% and 25% higher than the one achieved by a wide selection of 26 applications. In addition, the tester runs more than 50 times faster than those applications, which enables efficient and fast protocol debugging.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Autonomous Data-Race-Free GPU Testing\",\"authors\":\"T. Ta, Xianwei Zhang, Anthony Gutierrez, Bradford M. Beckmann\",\"doi\":\"10.1109/IISWC47752.2019.9042019\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the deep learning and high-performance computing markets continue to grow, hardware designers are increasingly optimizing future GPUs to run compute (a.k.a. GPGPU) workloads. A key area of optimization for these compute-oriented designs, which was not emphasized when GPUs exclusively executed graphics workloads, is inter-thread data sharing and synchronization. GPU cache coherence protocols now support these operations and are governed by a specified memory consistency model. In general, current GPU models are based on sequential consistency for data-race-free (SC for DRF), which mandates data written to memory must be globally visible only after certain synchronization points. GPU coherence protocols based on such relaxed memory models are particularly difficult to design and test due to the large number of memory accesses that may be reordered. This leaves GPU hardware designers struggling to validate the correctness of GPU cache coherence optimizations. To address this issue, this paper introduces a novel, completely autonomous random testing methodology for complex GPU cache coherence protocols. Our framework continuously generates sequences of memory requests with minimal user intervention using a mix of load, store, and atomic operations. The tester dynamically and autonomously checks each response against an expected global view of memory and immediately detects any inconsistencies in a target coherence protocol, providing designers detailed feedback on the issue. We then demonstrate the methodology on the popular cycle-level gem5 simulator by replacing its GPU core model with our unique testing framework. The results show that the GPU tester can cover 94% and 100% of all reachable state transitions in L1 and L2 caches respectively of a representative GPU coherence protocol. This coverage is 6.25% and 25% higher than the one achieved by a wide selection of 26 applications. In addition, the tester runs more than 50 times faster than those applications, which enables efficient and fast protocol debugging.\",\"PeriodicalId\":121068,\"journal\":{\"name\":\"2019 IEEE International Symposium on Workload Characterization (IISWC)\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Symposium on Workload Characterization (IISWC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IISWC47752.2019.9042019\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC47752.2019.9042019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
As the deep learning and high-performance computing markets continue to grow, hardware designers are increasingly optimizing future GPUs to run compute (a.k.a. GPGPU) workloads. A key area of optimization for these compute-oriented designs, which was not emphasized when GPUs exclusively executed graphics workloads, is inter-thread data sharing and synchronization. GPU cache coherence protocols now support these operations and are governed by a specified memory consistency model. In general, current GPU models are based on sequential consistency for data-race-free (SC for DRF), which mandates data written to memory must be globally visible only after certain synchronization points. GPU coherence protocols based on such relaxed memory models are particularly difficult to design and test due to the large number of memory accesses that may be reordered. This leaves GPU hardware designers struggling to validate the correctness of GPU cache coherence optimizations. To address this issue, this paper introduces a novel, completely autonomous random testing methodology for complex GPU cache coherence protocols. Our framework continuously generates sequences of memory requests with minimal user intervention using a mix of load, store, and atomic operations. The tester dynamically and autonomously checks each response against an expected global view of memory and immediately detects any inconsistencies in a target coherence protocol, providing designers detailed feedback on the issue. We then demonstrate the methodology on the popular cycle-level gem5 simulator by replacing its GPU core model with our unique testing framework. The results show that the GPU tester can cover 94% and 100% of all reachable state transitions in L1 and L2 caches respectively of a representative GPU coherence protocol. This coverage is 6.25% and 25% higher than the one achieved by a wide selection of 26 applications. In addition, the tester runs more than 50 times faster than those applications, which enables efficient and fast protocol debugging.