自主数据无竞争GPU测试

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-11-01 DOI:10.1109/IISWC47752.2019.9042019

T. Ta, Xianwei Zhang, Anthony Gutierrez, Bradford M. Beckmann

{"title":"自主数据无竞争GPU测试","authors":"T. Ta, Xianwei Zhang, Anthony Gutierrez, Bradford M. Beckmann","doi":"10.1109/IISWC47752.2019.9042019","DOIUrl":null,"url":null,"abstract":"As the deep learning and high-performance computing markets continue to grow, hardware designers are increasingly optimizing future GPUs to run compute (a.k.a. GPGPU) workloads. A key area of optimization for these compute-oriented designs, which was not emphasized when GPUs exclusively executed graphics workloads, is inter-thread data sharing and synchronization. GPU cache coherence protocols now support these operations and are governed by a specified memory consistency model. In general, current GPU models are based on sequential consistency for data-race-free (SC for DRF), which mandates data written to memory must be globally visible only after certain synchronization points. GPU coherence protocols based on such relaxed memory models are particularly difficult to design and test due to the large number of memory accesses that may be reordered. This leaves GPU hardware designers struggling to validate the correctness of GPU cache coherence optimizations. To address this issue, this paper introduces a novel, completely autonomous random testing methodology for complex GPU cache coherence protocols. Our framework continuously generates sequences of memory requests with minimal user intervention using a mix of load, store, and atomic operations. The tester dynamically and autonomously checks each response against an expected global view of memory and immediately detects any inconsistencies in a target coherence protocol, providing designers detailed feedback on the issue. We then demonstrate the methodology on the popular cycle-level gem5 simulator by replacing its GPU core model with our unique testing framework. The results show that the GPU tester can cover 94% and 100% of all reachable state transitions in L1 and L2 caches respectively of a representative GPU coherence protocol. This coverage is 6.25% and 25% higher than the one achieved by a wide selection of 26 applications. In addition, the tester runs more than 50 times faster than those applications, which enables efficient and fast protocol debugging.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Autonomous Data-Race-Free GPU Testing\",\"authors\":\"T. Ta, Xianwei Zhang, Anthony Gutierrez, Bradford M. Beckmann\",\"doi\":\"10.1109/IISWC47752.2019.9042019\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the deep learning and high-performance computing markets continue to grow, hardware designers are increasingly optimizing future GPUs to run compute (a.k.a. GPGPU) workloads. A key area of optimization for these compute-oriented designs, which was not emphasized when GPUs exclusively executed graphics workloads, is inter-thread data sharing and synchronization. GPU cache coherence protocols now support these operations and are governed by a specified memory consistency model. In general, current GPU models are based on sequential consistency for data-race-free (SC for DRF), which mandates data written to memory must be globally visible only after certain synchronization points. GPU coherence protocols based on such relaxed memory models are particularly difficult to design and test due to the large number of memory accesses that may be reordered. This leaves GPU hardware designers struggling to validate the correctness of GPU cache coherence optimizations. To address this issue, this paper introduces a novel, completely autonomous random testing methodology for complex GPU cache coherence protocols. Our framework continuously generates sequences of memory requests with minimal user intervention using a mix of load, store, and atomic operations. The tester dynamically and autonomously checks each response against an expected global view of memory and immediately detects any inconsistencies in a target coherence protocol, providing designers detailed feedback on the issue. We then demonstrate the methodology on the popular cycle-level gem5 simulator by replacing its GPU core model with our unique testing framework. The results show that the GPU tester can cover 94% and 100% of all reachable state transitions in L1 and L2 caches respectively of a representative GPU coherence protocol. This coverage is 6.25% and 25% higher than the one achieved by a wide selection of 26 applications. In addition, the tester runs more than 50 times faster than those applications, which enables efficient and fast protocol debugging.\",\"PeriodicalId\":121068,\"journal\":{\"name\":\"2019 IEEE International Symposium on Workload Characterization (IISWC)\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Symposium on Workload Characterization (IISWC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IISWC47752.2019.9042019\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC47752.2019.9042019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

随着深度学习和高性能计算市场的持续增长，硬件设计人员越来越多地优化未来的gpu以运行计算(又名GPGPU)工作负载。对于这些面向计算的设计，优化的一个关键领域是线程间数据共享和同步，这在gpu专门执行图形工作负载时没有得到强调。GPU缓存一致性协议现在支持这些操作，并由指定的内存一致性模型管理。一般来说，当前的GPU模型基于无数据竞争的顺序一致性(DRF的SC)，这要求写入内存的数据必须在某些同步点之后才能全局可见。基于这种宽松内存模型的GPU一致性协议特别难以设计和测试，因为大量的内存访问可能会被重新排序。这使得GPU硬件设计人员难以验证GPU缓存一致性优化的正确性。为了解决这个问题，本文介绍了一种新颖的、完全自主的随机测试方法，用于复杂的GPU缓存一致性协议。我们的框架使用混合的加载、存储和原子操作，以最少的用户干预连续地生成内存请求序列。测试人员根据预期的内存全局视图动态地、自主地检查每个响应，并立即检测目标一致性协议中的任何不一致，为设计人员提供有关问题的详细反馈。然后，我们通过用我们独特的测试框架替换其GPU核心模型，在流行的周期级gem5模拟器上演示该方法。结果表明，GPU测试器可以分别覆盖具有代表性的GPU一致性协议L1和L2缓存中所有可达状态转换的94%和100%。这一覆盖率比广泛选择的26种应用程序所达到的覆盖率高6.25%和25%。此外，测试器的运行速度比那些应用程序快50倍以上，这使协议调试变得高效和快速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Autonomous Data-Race-Free GPU Testing

As the deep learning and high-performance computing markets continue to grow, hardware designers are increasingly optimizing future GPUs to run compute (a.k.a. GPGPU) workloads. A key area of optimization for these compute-oriented designs, which was not emphasized when GPUs exclusively executed graphics workloads, is inter-thread data sharing and synchronization. GPU cache coherence protocols now support these operations and are governed by a specified memory consistency model. In general, current GPU models are based on sequential consistency for data-race-free (SC for DRF), which mandates data written to memory must be globally visible only after certain synchronization points. GPU coherence protocols based on such relaxed memory models are particularly difficult to design and test due to the large number of memory accesses that may be reordered. This leaves GPU hardware designers struggling to validate the correctness of GPU cache coherence optimizations. To address this issue, this paper introduces a novel, completely autonomous random testing methodology for complex GPU cache coherence protocols. Our framework continuously generates sequences of memory requests with minimal user intervention using a mix of load, store, and atomic operations. The tester dynamically and autonomously checks each response against an expected global view of memory and immediately detects any inconsistencies in a target coherence protocol, providing designers detailed feedback on the issue. We then demonstrate the methodology on the popular cycle-level gem5 simulator by replacing its GPU core model with our unique testing framework. The results show that the GPU tester can cover 94% and 100% of all reachable state transitions in L1 and L2 caches respectively of a representative GPU coherence protocol. This coverage is 6.25% and 25% higher than the one achieved by a wide selection of 26 applications. In addition, the tester runs more than 50 times faster than those applications, which enables efficient and fast protocol debugging.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE International Symposium on Workload Characterization (IISWC)

自引率

0.00%

发文量