Coherency Traffic Reduction in Manycore Systems

2022 25th Euromicro Conference on Digital System Design (DSD) Pub Date : 2022-08-01 DOI:10.1109/DSD57027.2022.00043

Erdem Derebaşoğlu, I. Kadayif, O. Ozturk

{"title":"Coherency Traffic Reduction in Manycore Systems","authors":"Erdem Derebaşoğlu, I. Kadayif, O. Ozturk","doi":"10.1109/DSD57027.2022.00043","DOIUrl":null,"url":null,"abstract":"With the increasing number of cores in manycore accelerators and chip multiprocessors (CMPs), it gets more challenging to provide cache coherency efficiently. Although the snooping-based protocols are appropriate solutions to small-scale systems, they are inefficient for large systems because of the limited bandwidth. Therefore, large-scale manycores require directory-based solutions where a hardware structure called directory holds the information. This directory keeps track of all memory blocks and which cache stores a copy of these blocks. The directory sends messages only to caches that store relevant blocks and also coordinate simultaneous accesses to a cache block. As directory-based protocols scale to many cores, performance, network-on-chip (NoC) traffic, and bandwidth become major problems. In this paper, we present software mechanisms to improve the effectiveness of directory-based cache coherency in manycore and multicore systems with shared memory. In multithreaded applications, some of the data accesses do not disrupt cache coherency, but they still produce coherency messages among cores such as read-only (private) data. However, if data is accessed by at least two cores and at least one of them is a write operation, it is called shared data and requires cache coherency. In our proposed system, private data and shared data are determined at compile time, and cache coherency protocol only applies to shared data. We implement our approach in two stages. First, we use Andersen's static pointer analysis to analyze the program and mark its private instructions, i.e., instructions that load or store private data. Then, we use these analyses to decide if cache coherency protocol will be applied or not at runtime. Our simulation results on parallel benchmarks show that our approach reduces cycle count, dynamic random access memory (DRAM) accesses, and coherency traffic up to 13%.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 25th Euromicro Conference on Digital System Design (DSD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSD57027.2022.00043","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

With the increasing number of cores in manycore accelerators and chip multiprocessors (CMPs), it gets more challenging to provide cache coherency efficiently. Although the snooping-based protocols are appropriate solutions to small-scale systems, they are inefficient for large systems because of the limited bandwidth. Therefore, large-scale manycores require directory-based solutions where a hardware structure called directory holds the information. This directory keeps track of all memory blocks and which cache stores a copy of these blocks. The directory sends messages only to caches that store relevant blocks and also coordinate simultaneous accesses to a cache block. As directory-based protocols scale to many cores, performance, network-on-chip (NoC) traffic, and bandwidth become major problems. In this paper, we present software mechanisms to improve the effectiveness of directory-based cache coherency in manycore and multicore systems with shared memory. In multithreaded applications, some of the data accesses do not disrupt cache coherency, but they still produce coherency messages among cores such as read-only (private) data. However, if data is accessed by at least two cores and at least one of them is a write operation, it is called shared data and requires cache coherency. In our proposed system, private data and shared data are determined at compile time, and cache coherency protocol only applies to shared data. We implement our approach in two stages. First, we use Andersen's static pointer analysis to analyze the program and mark its private instructions, i.e., instructions that load or store private data. Then, we use these analyses to decide if cache coherency protocol will be applied or not at runtime. Our simulation results on parallel benchmarks show that our approach reduces cycle count, dynamic random access memory (DRAM) accesses, and coherency traffic up to 13%.

查看原文本刊更多论文

多核系统的一致性流量减少

随着多核加速器和芯片多处理器(cmp)中内核数量的不断增加，高效地提供缓存一致性变得越来越具有挑战性。虽然基于窥探的协议是小型系统的合适解决方案，但由于带宽有限，对于大型系统来说效率不高。因此，大规模多核需要基于目录的解决方案，其中称为目录的硬件结构保存信息。该目录跟踪所有内存块以及哪个缓存存储这些块的副本。该目录仅向存储相关块并协调对缓存块的同时访问的缓存发送消息。随着基于目录的协议扩展到多个核心，性能、片上网络(NoC)流量和带宽成为主要问题。在本文中，我们提出了一种软件机制来提高多核和多核共享内存系统中基于目录的缓存一致性的有效性。在多线程应用程序中，一些数据访问不会破坏缓存一致性，但它们仍然会在内核(如只读(私有)数据)之间产生一致性消息。但是，如果数据被至少两个核访问，并且其中至少一个是写操作，则称为共享数据，并且需要缓存一致性。在我们提出的系统中，私有数据和共享数据是在编译时确定的，缓存一致性协议只适用于共享数据。我们分两个阶段实施我们的方法。首先，我们使用Andersen的静态指针分析来分析程序并标记其私有指令，即加载或存储私有数据的指令。然后，我们使用这些分析来决定是否在运行时应用缓存一致性协议。我们在并行基准测试上的模拟结果表明，我们的方法将周期计数、动态随机存取存储器(DRAM)访问和一致性流量减少了13%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 25th Euromicro Conference on Digital System Design (DSD)

自引率

0.00%

发文量