GLocks:在多核cmp中对高竞争锁的有效支持

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI:10.1109/IPDPS.2011.87

José L. Abellán, Juan Fernández, M. Acacio

{"title":"GLocks:在多核cmp中对高竞争锁的有效支持","authors":"José L. Abellán, Juan Fernández, M. Acacio","doi":"10.1109/IPDPS.2011.87","DOIUrl":null,"url":null,"abstract":"Synchronization is of paramount importance to exploit thread-level parallelism on many-core CMPs. In these architectures, synchronization mechanisms usually rely on shared variables to coordinate multithreaded access to shared data structures thus avoiding data dependency conflicts. Lock synchronization is known to be a key limitation to performance and scalability. On the one hand, lock acquisition through busy waiting on shared variables generates additional coherence activity which interferes with applications. On the other hand, lock contention causes serialization which results in performance degradation. This paper proposes and evaluates \\textit{GLocks}, a hardware-supported implementation for highly-contended locks in the context of many-core CMPs. \\textit{GLocks} use a token-based message-passing protocol over a dedicated network built on state-of-the-art technology. This approach skips the memory hierarchy to provide a non-intrusive, extremely efficient and fair lock implementation with negligible impact on energy consumption or die area. A comprehensive comparison against the most efficient shared-memory-based lock implementation for a set of micro benchmarks and real applications quantifies the goodness of \\textit{GLocks}. Performance results show an average reduction of 42% and 14% in execution time, an average reduction of 76% and 23% in network traffic, and also an average reduction of 78% and 28% in energy-delay$^2$ product (ED$^2$P) metric for the full CMP for the micro benchmarks and the real applications, respectively. In light of our performance results, we can conclude that \\textit{GLocks} satisfy our initial working hypothesis. \\textit{GLocks} minimize cache-coherence network traffic due to lock synchronization which translates into reduced power consumption and execution time.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"2002 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":"{\"title\":\"GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs\",\"authors\":\"José L. Abellán, Juan Fernández, M. Acacio\",\"doi\":\"10.1109/IPDPS.2011.87\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Synchronization is of paramount importance to exploit thread-level parallelism on many-core CMPs. In these architectures, synchronization mechanisms usually rely on shared variables to coordinate multithreaded access to shared data structures thus avoiding data dependency conflicts. Lock synchronization is known to be a key limitation to performance and scalability. On the one hand, lock acquisition through busy waiting on shared variables generates additional coherence activity which interferes with applications. On the other hand, lock contention causes serialization which results in performance degradation. This paper proposes and evaluates \\\\textit{GLocks}, a hardware-supported implementation for highly-contended locks in the context of many-core CMPs. \\\\textit{GLocks} use a token-based message-passing protocol over a dedicated network built on state-of-the-art technology. This approach skips the memory hierarchy to provide a non-intrusive, extremely efficient and fair lock implementation with negligible impact on energy consumption or die area. A comprehensive comparison against the most efficient shared-memory-based lock implementation for a set of micro benchmarks and real applications quantifies the goodness of \\\\textit{GLocks}. Performance results show an average reduction of 42% and 14% in execution time, an average reduction of 76% and 23% in network traffic, and also an average reduction of 78% and 28% in energy-delay$^2$ product (ED$^2$P) metric for the full CMP for the micro benchmarks and the real applications, respectively. In light of our performance results, we can conclude that \\\\textit{GLocks} satisfy our initial working hypothesis. \\\\textit{GLocks} minimize cache-coherence network traffic due to lock synchronization which translates into reduced power consumption and execution time.\",\"PeriodicalId\":355100,\"journal\":{\"name\":\"2011 IEEE International Parallel & Distributed Processing Symposium\",\"volume\":\"2002 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"32\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE International Parallel & Distributed Processing Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS.2011.87\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Parallel & Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2011.87","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

摘要

同步对于利用多核cmp上的线程级并行性至关重要。在这些体系结构中，同步机制通常依赖于共享变量来协调对共享数据结构的多线程访问，从而避免数据依赖冲突。众所周知，锁同步是性能和可伸缩性的一个关键限制。一方面，通过繁忙地等待共享变量来获取锁会产生额外的一致性活动，这会干扰应用程序。另一方面，锁争用会导致序列化，从而导致性能下降。本文提出并评估了\textit{GLocks}，这是一种在多核cmp环境下用于高竞争锁的硬件支持实现。\textit{GLocks}使用基于令牌的消息传递协议，通过建立在最先进技术上的专用网络。这种方法跳过了内存层次结构，提供了一种非侵入性的、极其高效的、公平的锁实现，对能耗或芯片面积的影响可以忽略不计。对一组微基准测试和实际应用程序中最有效的基于共享内存的锁实现进行全面比较，可以量化\textit{GLocks}的优点。性能结果显示平均减少了42% and 14% in execution time, an average reduction of 76% and 23% in network traffic, and also an average reduction of 78% and 28% in energy-delay$^2$ product (ED$^2$P) metric for the full CMP for the micro benchmarks and the real applications, respectively. In light of our performance results, we can conclude that \textit{GLocks} satisfy our initial working hypothesis. \textit{GLocks} minimize cache-coherence network traffic due to lock synchronization which translates into reduced power consumption and execution time.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs

Synchronization is of paramount importance to exploit thread-level parallelism on many-core CMPs. In these architectures, synchronization mechanisms usually rely on shared variables to coordinate multithreaded access to shared data structures thus avoiding data dependency conflicts. Lock synchronization is known to be a key limitation to performance and scalability. On the one hand, lock acquisition through busy waiting on shared variables generates additional coherence activity which interferes with applications. On the other hand, lock contention causes serialization which results in performance degradation. This paper proposes and evaluates \textit{GLocks}, a hardware-supported implementation for highly-contended locks in the context of many-core CMPs. \textit{GLocks} use a token-based message-passing protocol over a dedicated network built on state-of-the-art technology. This approach skips the memory hierarchy to provide a non-intrusive, extremely efficient and fair lock implementation with negligible impact on energy consumption or die area. A comprehensive comparison against the most efficient shared-memory-based lock implementation for a set of micro benchmarks and real applications quantifies the goodness of \textit{GLocks}. Performance results show an average reduction of 42% and 14% in execution time, an average reduction of 76% and 23% in network traffic, and also an average reduction of 78% and 28% in energy-delay$^2$ product (ED$^2$P) metric for the full CMP for the micro benchmarks and the real applications, respectively. In light of our performance results, we can conclude that \textit{GLocks} satisfy our initial working hypothesis. \textit{GLocks} minimize cache-coherence network traffic due to lock synchronization which translates into reduced power consumption and execution time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 IEEE International Parallel & Distributed Processing Symposium

自引率

0.00%

发文量