Opportunistic Competition Overhead Reduction for Expediting Critical Section in NoC Based CMPs

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2016-06-01 DOI:10.1145/3007787.3001167

Y. Yao, Zhonghai Lu

{"title":"Opportunistic Competition Overhead Reduction for Expediting Critical Section in NoC Based CMPs","authors":"Y. Yao, Zhonghai Lu","doi":"10.1145/3007787.3001167","DOIUrl":null,"url":null,"abstract":"With the degree of parallelism increasing, performance of multi-threaded shared variable applications is not only limited by serialized critical section execution, but also by the serialized competition overhead for threads to get access to critical section. As the number of concurrent threads grows, such competition overhead may exceed the time spent in critical section itself, and become the dominating factor limiting the performance of parallel applications. In modern operating systems, queue spinlock, which comprises a low-overhead spinning phase and a high-overhead sleeping phase, is often used to lock critical sections. In the paper, we show that this advanced locking solution may create very high competition overhead for multithreaded applications executing in NoC-based CMPs. Then we propose a software-hardware cooperative mechanism that can opportunistically maximize the chance that a thread wins the critical section access in the low-overhead spinning phase, thereby reducing the competition overhead. At the OS primitives level, we monitor the remaining times of retry (RTR) in a thread's spinning phase, which reflects in how long the thread must enter into the high-overhead sleep mode. At the hardware level, we integrate the RTR information into the packets of locking requests, and let the NoC prioritize locking request packets according to the RTR information. The principle is that the smaller RTR a locking request packet carries, the higher priority it gets and thus quicker delivery. We evaluate our opportunistic competition overhead reduction technique with cycle-accurate full-system simulations in GEM5 using PARSEC (11 programs) and SPEC OMP2012 (14 programs) benchmarks. Compared to the original queue spinlock implementation, experimental results show that our method can effectively increase the opportunity of threads entering the critical section in low-overhead spinning phase, reducing the competition overhead averagely by 39.9% (maximally by 61.8%) and accelerating the execution of the Region-of-Interest averagely by 14.4% (maximally by 24.5%) across all 25 benchmark programs.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"83 1","pages":"279-290"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3007787.3001167","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

With the degree of parallelism increasing, performance of multi-threaded shared variable applications is not only limited by serialized critical section execution, but also by the serialized competition overhead for threads to get access to critical section. As the number of concurrent threads grows, such competition overhead may exceed the time spent in critical section itself, and become the dominating factor limiting the performance of parallel applications. In modern operating systems, queue spinlock, which comprises a low-overhead spinning phase and a high-overhead sleeping phase, is often used to lock critical sections. In the paper, we show that this advanced locking solution may create very high competition overhead for multithreaded applications executing in NoC-based CMPs. Then we propose a software-hardware cooperative mechanism that can opportunistically maximize the chance that a thread wins the critical section access in the low-overhead spinning phase, thereby reducing the competition overhead. At the OS primitives level, we monitor the remaining times of retry (RTR) in a thread's spinning phase, which reflects in how long the thread must enter into the high-overhead sleep mode. At the hardware level, we integrate the RTR information into the packets of locking requests, and let the NoC prioritize locking request packets according to the RTR information. The principle is that the smaller RTR a locking request packet carries, the higher priority it gets and thus quicker delivery. We evaluate our opportunistic competition overhead reduction technique with cycle-accurate full-system simulations in GEM5 using PARSEC (11 programs) and SPEC OMP2012 (14 programs) benchmarks. Compared to the original queue spinlock implementation, experimental results show that our method can effectively increase the opportunity of threads entering the critical section in low-overhead spinning phase, reducing the competition overhead averagely by 39.9% (maximally by 61.8%) and accelerating the execution of the Region-of-Interest averagely by 14.4% (maximally by 24.5%) across all 25 benchmark programs.

查看原文本刊更多论文

减少基于NoC的cmp关键段加速的机会竞争开销

随着并行度的提高，多线程共享变量应用程序的性能不仅受到序列化临界区执行的限制，而且受到线程访问临界区的序列化竞争开销的限制。随着并发线程数量的增加，这种竞争开销可能会超过在临界区本身所花费的时间，并成为限制并行应用程序性能的主要因素。在现代操作系统中，队列自旋锁通常用于锁定临界区，它包括一个低开销的旋转阶段和一个高开销的休眠阶段。在本文中，我们证明了这种先进的锁定解决方案可能会给在基于noc的cmp中执行的多线程应用程序带来非常高的竞争开销。然后，我们提出了一种软硬件合作机制，可以机会最大化线程在低开销旋转阶段赢得临界区访问的机会，从而降低竞争开销。在操作系统原语级别，我们监视线程旋转阶段的剩余重试时间(RTR)，这反映了线程必须进入高开销睡眠模式的时间。在硬件层，我们将RTR信息集成到锁定请求包中，并让NoC根据RTR信息对锁定请求包进行优先级排序。原理是，一个锁定请求包携带的RTR越小，它的优先级就越高，因此传递速度就越快。我们在GEM5中使用PARSEC(11个程序)和SPEC OMP2012(14个程序)基准测试，通过循环精确的全系统模拟来评估我们的机会竞争开销降低技术。实验结果表明，与原始队列自旋锁实现相比，我们的方法可以有效地增加线程在低开销自旋阶段进入临界区域的机会，在所有25个基准程序中，竞争开销平均降低39.9%(最大降低61.8%)，感兴趣区域的执行速度平均提高14.4%(最大提高24.5%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量