Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table

Proceedings of the ACM International Conference on Computing Frontiers Pub Date : 2016-05-16 DOI:10.1145/2903150.2903175

M. Soltaniyeh, I. Kadayif, Özcan Özturk

{"title":"Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table","authors":"M. Soltaniyeh, I. Kadayif, Özcan Özturk","doi":"10.1145/2903150.2903175","DOIUrl":null,"url":null,"abstract":"Chip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core CMPs to keep the data blocks coherent at the last level private caches. However, the area overhead and high associativity requirement of the directory structures may not scale well with increasingly higher number of cores. As shown in some prior studies, a significant percentage of data blocks are accessed by only one core, therefore, it is not necessary to keep track of these in the directory structure. In this study, we have two major contributions. First, we show that compared to the classification of cache blocks at page granularity as done in some previous studies, data block classification at subpage level helps to detect considerably more private data blocks. Consequently, it reduces the percentage of blocks required to be tracked in the directory significantly compared to similar page level classification approaches. This, in turn, enables smaller directory caches with lower associativity to be used in CMPs without hurting performance, thereby helping the directory structure to scale gracefully with the increasing number of cores. Memory block classification at subpage level, however, may increase the frequency of the Operating System's (OS) involvement in updating the maintenance bits belonging to subpages stored in page table entries, nullifying some portion of performance benefits of subpage level data classification. To overcome this, we propose a distributed on-chip page table as a our second contribution.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2903150.2903175","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Chip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core CMPs to keep the data blocks coherent at the last level private caches. However, the area overhead and high associativity requirement of the directory structures may not scale well with increasingly higher number of cores. As shown in some prior studies, a significant percentage of data blocks are accessed by only one core, therefore, it is not necessary to keep track of these in the directory structure. In this study, we have two major contributions. First, we show that compared to the classification of cache blocks at page granularity as done in some previous studies, data block classification at subpage level helps to detect considerably more private data blocks. Consequently, it reduces the percentage of blocks required to be tracked in the directory significantly compared to similar page level classification approaches. This, in turn, enables smaller directory caches with lower associativity to be used in CMPs without hurting performance, thereby helping the directory structure to scale gracefully with the increasing number of cores. Memory block classification at subpage level, however, may increase the frequency of the Operating System's (OS) involvement in updating the maintenance bits belonging to subpages stored in page table entries, nullifying some portion of performance benefits of subpage level data classification. To overcome this, we propose a distributed on-chip page table as a our second contribution.

查看原文本刊更多论文

提高基于目录的缓存一致性协议在子页面粒度上的一致性绕过和一个新的片上页表的性能

芯片多处理器(cmp)需要有效的缓存一致性协议以及快速的虚拟到物理地址转换机制来实现高性能。基于目录的缓存一致性协议是多核cmp中最先进的方法，用于保持数据块在最后一级私有缓存中的一致性。但是，随着内核数量的增加，目录结构的面积开销和高关联性需求可能无法很好地扩展。如先前的一些研究所示，很大一部分数据块仅由一个核心访问，因此，没有必要在目录结构中跟踪这些数据块。在这项研究中，我们有两个主要贡献。首先，我们表明，与之前的一些研究中在页面粒度上对缓存块进行分类相比，在子页面级别上对数据块进行分类有助于检测更多的私有数据块。因此，与类似的页面级分类方法相比，它大大减少了需要在目录中跟踪的块的百分比。这反过来又支持在cmp中使用具有较低关联性的较小目录缓存，而不会损害性能，从而帮助目录结构随着内核数量的增加而优雅地扩展。然而，子页级别的内存块分类可能会增加操作系统(OS)参与更新属于存储在页表项中的子页的维护位的频率，从而抵消了子页级别数据分类的部分性能优势。为了克服这个问题，我们提出了一个分布式片上页表作为我们的第二个贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ACM International Conference on Computing Frontiers

自引率

0.00%

发文量