BifurKTM: gpu的近似一致分布式事务内存

PARMA-DITAM@HiPEAC Pub Date : 1900-01-01 DOI:10.4230/OASIcs.PARMA-DITAM.2021.2

Samuel Irving, Lu Peng, C. Busch, J. Peir

{"title":"BifurKTM: gpu的近似一致分布式事务内存","authors":"Samuel Irving, Lu Peng, C. Busch, J. Peir","doi":"10.4230/OASIcs.PARMA-DITAM.2021.2","DOIUrl":null,"url":null,"abstract":"We present BifurKTM, the first read-optimized Distributed Transactional Memory system for GPU clusters. The BifurKTM design includes: GPU KoSTM, a new software transactional memory conflict detection scheme that exploits relaxed consistency to increase throughput; and KoDTM, a Distributed Transactional Memory model that combines the Dataand Controlflow models to greatly reduce communication overheads. Despite the allure of huge speedups, GPUs are limited in use due to their programmability and extreme sensitivity to workload characteristics. These become daunting concerns when considering a distributed GPU cluster, wherein a programmer must design algorithms to hide communication latency by exploiting data regularity, high compute intensity, etc. The BifurKTM design allows GPU programmers to exploit a new workload characteristic: the percentage of the workload that is Read-Only (e.g. reads but does not modify shared memory), even when this percentage is not known in advance. Programmers designate transactions that are suitable for Approximate Consistency, in which transactions “appear” to execute at the most convenient time for preventing conflicts. By leveraging Approximate Consistency for Read-Only transactions, the BifurKTM runtime system offers improved performance, application flexibility, and programmability without introducing any errors into shared memory. Our experiments show that Approximate Consistency can improve BkTM performance by up to 34x in applications with moderate network communication utilization and a read-intensive workload. Using Approximate Consistency, BkTM can reduce GPU-to-GPU network communication by 99%, reduce the number of aborts by up to 100%, and achieve an average speedup of 18x over a similarly sized CPU cluster while requiring minimal effort from the programmer. 2012 ACM Subject Classification Computer systems organization → Heterogeneous (hybrid) systems","PeriodicalId":436349,"journal":{"name":"PARMA-DITAM@HiPEAC","volume":"430 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"BifurKTM: Approximately Consistent Distributed Transactional Memory for GPUs\",\"authors\":\"Samuel Irving, Lu Peng, C. Busch, J. Peir\",\"doi\":\"10.4230/OASIcs.PARMA-DITAM.2021.2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present BifurKTM, the first read-optimized Distributed Transactional Memory system for GPU clusters. The BifurKTM design includes: GPU KoSTM, a new software transactional memory conflict detection scheme that exploits relaxed consistency to increase throughput; and KoDTM, a Distributed Transactional Memory model that combines the Dataand Controlflow models to greatly reduce communication overheads. Despite the allure of huge speedups, GPUs are limited in use due to their programmability and extreme sensitivity to workload characteristics. These become daunting concerns when considering a distributed GPU cluster, wherein a programmer must design algorithms to hide communication latency by exploiting data regularity, high compute intensity, etc. The BifurKTM design allows GPU programmers to exploit a new workload characteristic: the percentage of the workload that is Read-Only (e.g. reads but does not modify shared memory), even when this percentage is not known in advance. Programmers designate transactions that are suitable for Approximate Consistency, in which transactions “appear” to execute at the most convenient time for preventing conflicts. By leveraging Approximate Consistency for Read-Only transactions, the BifurKTM runtime system offers improved performance, application flexibility, and programmability without introducing any errors into shared memory. Our experiments show that Approximate Consistency can improve BkTM performance by up to 34x in applications with moderate network communication utilization and a read-intensive workload. Using Approximate Consistency, BkTM can reduce GPU-to-GPU network communication by 99%, reduce the number of aborts by up to 100%, and achieve an average speedup of 18x over a similarly sized CPU cluster while requiring minimal effort from the programmer. 2012 ACM Subject Classification Computer systems organization → Heterogeneous (hybrid) systems\",\"PeriodicalId\":436349,\"journal\":{\"name\":\"PARMA-DITAM@HiPEAC\",\"volume\":\"430 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PARMA-DITAM@HiPEAC\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4230/OASIcs.PARMA-DITAM.2021.2\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PARMA-DITAM@HiPEAC","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/OASIcs.PARMA-DITAM.2021.2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

我们提出了BifurKTM，这是第一个用于GPU集群的读优化分布式事务内存系统。BifurKTM设计包括:GPU KoSTM，一种新的软件事务性内存冲突检测方案，利用宽松的一致性来提高吞吐量;KoDTM是一种分布式事务性内存模型，它结合了数据流和控制流模型，大大降低了通信开销。尽管gpu具有巨大的加速吸引力，但由于其可编程性和对工作负载特性的极端敏感性，gpu的使用受到限制。在考虑分布式GPU集群时，这些问题变得令人生畏，其中程序员必须设计算法，通过利用数据规律性、高计算强度等来隐藏通信延迟。BifurKTM设计允许GPU程序员利用一个新的工作负载特性:只读工作负载的百分比(例如读取但不修改共享内存)，即使这个百分比事先不知道。程序员指定适合近似一致性的事务，其中事务“似乎”在最方便的时间执行，以防止冲突。通过利用只读事务的近似一致性，BifurKTM运行时系统提供了改进的性能、应用程序灵活性和可编程性，而不会在共享内存中引入任何错误。我们的实验表明，在具有中等网络通信利用率和读密集型工作负载的应用程序中，近似一致性可以将BkTM性能提高34倍。使用近似一致性，BkTM可以将gpu到gpu的网络通信减少99%，将中止的数量减少高达100%，并且在程序员需要最小努力的情况下，在类似大小的CPU集群上实现18倍的平均加速。计算机系统组织→异构(混合)系统

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

BifurKTM: Approximately Consistent Distributed Transactional Memory for GPUs

We present BifurKTM, the first read-optimized Distributed Transactional Memory system for GPU clusters. The BifurKTM design includes: GPU KoSTM, a new software transactional memory conflict detection scheme that exploits relaxed consistency to increase throughput; and KoDTM, a Distributed Transactional Memory model that combines the Dataand Controlflow models to greatly reduce communication overheads. Despite the allure of huge speedups, GPUs are limited in use due to their programmability and extreme sensitivity to workload characteristics. These become daunting concerns when considering a distributed GPU cluster, wherein a programmer must design algorithms to hide communication latency by exploiting data regularity, high compute intensity, etc. The BifurKTM design allows GPU programmers to exploit a new workload characteristic: the percentage of the workload that is Read-Only (e.g. reads but does not modify shared memory), even when this percentage is not known in advance. Programmers designate transactions that are suitable for Approximate Consistency, in which transactions “appear” to execute at the most convenient time for preventing conflicts. By leveraging Approximate Consistency for Read-Only transactions, the BifurKTM runtime system offers improved performance, application flexibility, and programmability without introducing any errors into shared memory. Our experiments show that Approximate Consistency can improve BkTM performance by up to 34x in applications with moderate network communication utilization and a read-intensive workload. Using Approximate Consistency, BkTM can reduce GPU-to-GPU network communication by 99%, reduce the number of aborts by up to 100%, and achieve an average speedup of 18x over a similarly sized CPU cluster while requiring minimal effort from the programmer. 2012 ACM Subject Classification Computer systems organization → Heterogeneous (hybrid) systems

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

PARMA-DITAM@HiPEAC

自引率

0.00%

发文量