Analysis of shared memory misses and reference patterns

Proceedings 2000 International Conference on Computer Design Pub Date : 2000-09-17 DOI:10.1109/ICCD.2000.878285

J. Rothman, A. Smith

{"title":"Analysis of shared memory misses and reference patterns","authors":"J. Rothman, A. Smith","doi":"10.1109/ICCD.2000.878285","DOIUrl":null,"url":null,"abstract":"Shared bus computer systems permit the relatively simple and efficient implementation of cache consistency algorithms, but the shared bus is a bottleneck which limits performance. False sharing can be an important source of unnecessary traffic for invalidation-based protocols, elimination of which can provide significant performance improvements. For many multiprocessor workloads, however, most misses are true sharing plus cold start misses. Regardless of the cause of cache misses, the largest fraction of bus traffic are words transferred between caches without being accessed, which we refer to as dead sharing. We establish here new methods for characterizing cache block reference patterns, and we measure how these patterns change with variation in workload and block size. Our results show that 42 percent of 64-byte cache blocks are invalidated before more than one word has been read from the block and that 58 percent of blocks that have been modified only have a single word modified before an invalidation to the block occurs. Approximately 50 percent of blocks written and subsequently read by other caches show no use of the newly written information before the block is again invalidated. In addition to our general analysis of reference patterns, we also present a detailed analysis of dead sharing for each shared memory multiprocessor program studied. We find that the worst 10 blocks (based on most total misses) from each of our traces contribute almost 50 percent of the false shearing misses and almost 20 percent of the true sharing misses (on average). A relatively simple restructuring of four of our workloads based on analysis of these 10 worst blocks leads to a 21 percent reduction in overall misses and a 15 percent reduction in execution time. Permitting the block size to vary (as could be accomplished with a sector cache) shows that bus traffic can be reduced by 88 percent (for 64-byte blocks) while also decreasing the miss ratio by 35 percent.","PeriodicalId":437697,"journal":{"name":"Proceedings 2000 International Conference on Computer Design","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2000-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 2000 International Conference on Computer Design","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCD.2000.878285","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Shared bus computer systems permit the relatively simple and efficient implementation of cache consistency algorithms, but the shared bus is a bottleneck which limits performance. False sharing can be an important source of unnecessary traffic for invalidation-based protocols, elimination of which can provide significant performance improvements. For many multiprocessor workloads, however, most misses are true sharing plus cold start misses. Regardless of the cause of cache misses, the largest fraction of bus traffic are words transferred between caches without being accessed, which we refer to as dead sharing. We establish here new methods for characterizing cache block reference patterns, and we measure how these patterns change with variation in workload and block size. Our results show that 42 percent of 64-byte cache blocks are invalidated before more than one word has been read from the block and that 58 percent of blocks that have been modified only have a single word modified before an invalidation to the block occurs. Approximately 50 percent of blocks written and subsequently read by other caches show no use of the newly written information before the block is again invalidated. In addition to our general analysis of reference patterns, we also present a detailed analysis of dead sharing for each shared memory multiprocessor program studied. We find that the worst 10 blocks (based on most total misses) from each of our traces contribute almost 50 percent of the false shearing misses and almost 20 percent of the true sharing misses (on average). A relatively simple restructuring of four of our workloads based on analysis of these 10 worst blocks leads to a 21 percent reduction in overall misses and a 15 percent reduction in execution time. Permitting the block size to vary (as could be accomplished with a sector cache) shows that bus traffic can be reduced by 88 percent (for 64-byte blocks) while also decreasing the miss ratio by 35 percent.

查看原文本刊更多论文

分析共享内存缺失和引用模式

共享总线计算机系统允许相对简单和有效地实现缓存一致性算法，但共享总线是限制性能的瓶颈。对于基于无效的协议来说，错误共享可能是不必要流量的重要来源，消除错误共享可以显著提高性能。然而，对于许多多处理器工作负载，大多数失败是真正的共享加上冷启动失败。不管缓存丢失的原因是什么，总线流量中最大的一部分是在缓存之间传输而不被访问的字，我们称之为死共享。本文建立了表征缓存块引用模式的新方法，并测量了这些模式如何随着工作负载和块大小的变化而变化。我们的结果表明，42%的64字节缓存块在从块中读取多个单词之前就失效了，58%的被修改的块在块失效之前只修改了一个单词。大约50%的被其他缓存写入并随后读取的块在块再次失效之前没有使用新写入的信息。除了对参考模式的一般分析之外，我们还对所研究的每个共享内存多处理器程序的死共享进行了详细分析。我们发现，每条轨迹中最糟糕的10个区块(基于大多数总缺失)贡献了近50%的虚假剪切缺失和近20%的真实共享缺失(平均)。基于对这10个最差块的分析，对我们的4个工作负载进行相对简单的重组，可以减少21%的总体失误和15%的执行时间。允许块大小变化(可以通过扇区缓存实现)表明总线流量可以减少88%(对于64字节块)，同时还可以减少35%的丢失率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings 2000 International Conference on Computer Design

自引率

0.00%

发文量