扩展非缓存一致域中的共享内存多处理应用程序

Proceedings of the 13th ACM International Systems and Storage Conference Pub Date : 2020-05-30 DOI:10.1145/3383669.3398278

Ho-Ren Chuang, Robert Lyerly, Stefan Lankes, B. Ravindran

{"title":"扩展非缓存一致域中的共享内存多处理应用程序","authors":"Ho-Ren Chuang, Robert Lyerly, Stefan Lankes, B. Ravindran","doi":"10.1145/3383669.3398278","DOIUrl":null,"url":null,"abstract":"Due to the slowdown of Moore's Law, systems designers have begun integrating non-cache-coherent heterogeneous computing elements in order to continue scaling performance. Programming such systems has traditionally been difficult - developers were forced to use programming models that exposed multiple memory regions, requiring developers to manually maintain memory consistency. Previous works proposed distributed shared memory (DSM) as a way to achieve high programmability in such systems. However, past DSM systems were plagued by low-bandwidth networking and utilized complex memory consistency protocols, which limited their adoption. Recently, new networking technologies have begun to change the assumptions about which components are bottlenecks in the system. Additionally, many popular shared-memory programming models utilize memory consistency semantics similar to those proposed for DSM, leading to widespread adoption in mainstream programming. In this work, we argue that it is time to revive DSM as a means for achieving good programmability and performance on non-cache-coherent systems. We explore optimizing an existing DSM protocol by relaxing memory consistency semantics and exposing new cross-node barrier primitives. We integrate the new mechanisms into an existing OpenMP runtime, allowing developers to leverage cross-node execution without changing a single line of code. When evaluated on an x86 server connected to an ARMv8 server via InfiniBand, the DSM optimizations achieve an average of 11% (up to 33%) improvement versus the baseline DSM implementation.","PeriodicalId":225327,"journal":{"name":"Proceedings of the 13th ACM International Systems and Storage Conference","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Scaling Shared Memory Multiprocessing Applications in Non-cache-coherent Domains\",\"authors\":\"Ho-Ren Chuang, Robert Lyerly, Stefan Lankes, B. Ravindran\",\"doi\":\"10.1145/3383669.3398278\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to the slowdown of Moore's Law, systems designers have begun integrating non-cache-coherent heterogeneous computing elements in order to continue scaling performance. Programming such systems has traditionally been difficult - developers were forced to use programming models that exposed multiple memory regions, requiring developers to manually maintain memory consistency. Previous works proposed distributed shared memory (DSM) as a way to achieve high programmability in such systems. However, past DSM systems were plagued by low-bandwidth networking and utilized complex memory consistency protocols, which limited their adoption. Recently, new networking technologies have begun to change the assumptions about which components are bottlenecks in the system. Additionally, many popular shared-memory programming models utilize memory consistency semantics similar to those proposed for DSM, leading to widespread adoption in mainstream programming. In this work, we argue that it is time to revive DSM as a means for achieving good programmability and performance on non-cache-coherent systems. We explore optimizing an existing DSM protocol by relaxing memory consistency semantics and exposing new cross-node barrier primitives. We integrate the new mechanisms into an existing OpenMP runtime, allowing developers to leverage cross-node execution without changing a single line of code. When evaluated on an x86 server connected to an ARMv8 server via InfiniBand, the DSM optimizations achieve an average of 11% (up to 33%) improvement versus the baseline DSM implementation.\",\"PeriodicalId\":225327,\"journal\":{\"name\":\"Proceedings of the 13th ACM International Systems and Storage Conference\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-05-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 13th ACM International Systems and Storage Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3383669.3398278\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th ACM International Systems and Storage Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3383669.3398278","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

由于摩尔定律的放缓，系统设计师已经开始集成非缓存一致的异构计算元素，以继续扩展性能。编程这样的系统传统上是困难的——开发人员被迫使用暴露多个内存区域的编程模型，要求开发人员手动维护内存一致性。以前的工作提出分布式共享内存(DSM)作为在此类系统中实现高可编程性的一种方法。然而，过去的DSM系统受到低带宽网络的困扰，并且使用复杂的内存一致性协议，这限制了它们的采用。最近，新的网络技术已经开始改变关于哪些组件是系统瓶颈的假设。此外，许多流行的共享内存编程模型利用与DSM相似的内存一致性语义，从而在主流编程中得到广泛采用。在这项工作中，我们认为是时候恢复DSM作为在非缓存一致系统上实现良好可编程性和性能的手段了。我们探索通过放宽内存一致性语义和暴露新的跨节点屏障原语来优化现有的DSM协议。我们将新机制集成到现有的OpenMP运行时中，允许开发人员利用跨节点执行而无需更改一行代码。在通过InfiniBand连接到ARMv8服务器的x86服务器上进行评估时，与基线DSM实现相比，DSM优化平均实现了11%(最高33%)的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Scaling Shared Memory Multiprocessing Applications in Non-cache-coherent Domains

Due to the slowdown of Moore's Law, systems designers have begun integrating non-cache-coherent heterogeneous computing elements in order to continue scaling performance. Programming such systems has traditionally been difficult - developers were forced to use programming models that exposed multiple memory regions, requiring developers to manually maintain memory consistency. Previous works proposed distributed shared memory (DSM) as a way to achieve high programmability in such systems. However, past DSM systems were plagued by low-bandwidth networking and utilized complex memory consistency protocols, which limited their adoption. Recently, new networking technologies have begun to change the assumptions about which components are bottlenecks in the system. Additionally, many popular shared-memory programming models utilize memory consistency semantics similar to those proposed for DSM, leading to widespread adoption in mainstream programming. In this work, we argue that it is time to revive DSM as a means for achieving good programmability and performance on non-cache-coherent systems. We explore optimizing an existing DSM protocol by relaxing memory consistency semantics and exposing new cross-node barrier primitives. We integrate the new mechanisms into an existing OpenMP runtime, allowing developers to leverage cross-node execution without changing a single line of code. When evaluated on an x86 server connected to an ARMv8 server via InfiniBand, the DSM optimizations achieve an average of 11% (up to 33%) improvement versus the baseline DSM implementation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 13th ACM International Systems and Storage Conference

自引率

0.00%

发文量