Coherency Traffic Reduction in Manycore Systems

Erdem Derebaşoğlu, I. Kadayif, O. Ozturk
{"title":"Coherency Traffic Reduction in Manycore Systems","authors":"Erdem Derebaşoğlu, I. Kadayif, O. Ozturk","doi":"10.1109/DSD57027.2022.00043","DOIUrl":null,"url":null,"abstract":"With the increasing number of cores in manycore accelerators and chip multiprocessors (CMPs), it gets more challenging to provide cache coherency efficiently. Although the snooping-based protocols are appropriate solutions to small-scale systems, they are inefficient for large systems because of the limited bandwidth. Therefore, large-scale manycores require directory-based solutions where a hardware structure called directory holds the information. This directory keeps track of all memory blocks and which cache stores a copy of these blocks. The directory sends messages only to caches that store relevant blocks and also coordinate simultaneous accesses to a cache block. As directory-based protocols scale to many cores, performance, network-on-chip (NoC) traffic, and bandwidth become major problems. In this paper, we present software mechanisms to improve the effectiveness of directory-based cache coherency in manycore and multicore systems with shared memory. In multithreaded applications, some of the data accesses do not disrupt cache coherency, but they still produce coherency messages among cores such as read-only (private) data. However, if data is accessed by at least two cores and at least one of them is a write operation, it is called shared data and requires cache coherency. In our proposed system, private data and shared data are determined at compile time, and cache coherency protocol only applies to shared data. We implement our approach in two stages. First, we use Andersen's static pointer analysis to analyze the program and mark its private instructions, i.e., instructions that load or store private data. Then, we use these analyses to decide if cache coherency protocol will be applied or not at runtime. Our simulation results on parallel benchmarks show that our approach reduces cycle count, dynamic random access memory (DRAM) accesses, and coherency traffic up to 13%.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 25th Euromicro Conference on Digital System Design (DSD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSD57027.2022.00043","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

With the increasing number of cores in manycore accelerators and chip multiprocessors (CMPs), it gets more challenging to provide cache coherency efficiently. Although the snooping-based protocols are appropriate solutions to small-scale systems, they are inefficient for large systems because of the limited bandwidth. Therefore, large-scale manycores require directory-based solutions where a hardware structure called directory holds the information. This directory keeps track of all memory blocks and which cache stores a copy of these blocks. The directory sends messages only to caches that store relevant blocks and also coordinate simultaneous accesses to a cache block. As directory-based protocols scale to many cores, performance, network-on-chip (NoC) traffic, and bandwidth become major problems. In this paper, we present software mechanisms to improve the effectiveness of directory-based cache coherency in manycore and multicore systems with shared memory. In multithreaded applications, some of the data accesses do not disrupt cache coherency, but they still produce coherency messages among cores such as read-only (private) data. However, if data is accessed by at least two cores and at least one of them is a write operation, it is called shared data and requires cache coherency. In our proposed system, private data and shared data are determined at compile time, and cache coherency protocol only applies to shared data. We implement our approach in two stages. First, we use Andersen's static pointer analysis to analyze the program and mark its private instructions, i.e., instructions that load or store private data. Then, we use these analyses to decide if cache coherency protocol will be applied or not at runtime. Our simulation results on parallel benchmarks show that our approach reduces cycle count, dynamic random access memory (DRAM) accesses, and coherency traffic up to 13%.
多核系统的一致性流量减少
随着多核加速器和芯片多处理器(cmp)中内核数量的不断增加,高效地提供缓存一致性变得越来越具有挑战性。虽然基于窥探的协议是小型系统的合适解决方案,但由于带宽有限,对于大型系统来说效率不高。因此,大规模多核需要基于目录的解决方案,其中称为目录的硬件结构保存信息。该目录跟踪所有内存块以及哪个缓存存储这些块的副本。该目录仅向存储相关块并协调对缓存块的同时访问的缓存发送消息。随着基于目录的协议扩展到多个核心,性能、片上网络(NoC)流量和带宽成为主要问题。在本文中,我们提出了一种软件机制来提高多核和多核共享内存系统中基于目录的缓存一致性的有效性。在多线程应用程序中,一些数据访问不会破坏缓存一致性,但它们仍然会在内核(如只读(私有)数据)之间产生一致性消息。但是,如果数据被至少两个核访问,并且其中至少一个是写操作,则称为共享数据,并且需要缓存一致性。在我们提出的系统中,私有数据和共享数据是在编译时确定的,缓存一致性协议只适用于共享数据。我们分两个阶段实施我们的方法。首先,我们使用Andersen的静态指针分析来分析程序并标记其私有指令,即加载或存储私有数据的指令。然后,我们使用这些分析来决定是否在运行时应用缓存一致性协议。我们在并行基准测试上的模拟结果表明,我们的方法将周期计数、动态随机存取存储器(DRAM)访问和一致性流量减少了13%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信