Modeling and Single-Pass Simulation of CMP Cache Capacity and Accessibility

2007 IEEE International Symposium on Performance Analysis of Systems & Software Pub Date : 2007-04-25 DOI:10.1109/ISPASS.2007.363743

Xudong Shi, Feiqi Su, J. Peir, Ye Xia, Zhen Yang

{"title":"Modeling and Single-Pass Simulation of CMP Cache Capacity and Accessibility","authors":"Xudong Shi, Feiqi Su, J. Peir, Ye Xia, Zhen Yang","doi":"10.1109/ISPASS.2007.363743","DOIUrl":null,"url":null,"abstract":"The future chip-multiprocessors (CMPs) with a large number of cores faces difficult issues in efficient utilizing on-chip storage space. Tradeoffs between data accessibility and effective on-chip capacity have been studied extensively. It requires costly simulations to understand a wide-spectrum of design spaces. In this paper, we first develop an abstract model for understanding the performance impact with respect to the degree of data replication. To overcome the lack of real-time interactions among multiple cores in the abstract model, we propose an efficient single-pass stack simulation method to study the performance of a variety of cache organizations on CMPs. The proposed global stack logically incorporates a shared stack and per-core private stacks to collect shared/private reuse (stack) distances for every memory reference in a single simulation pass. With the collected reuse distances, performance in terms of hits/misses and average memory access times can be calculated for multiple cache organizations. The basic stack simulation results can further derive other CMP cache organizations with various degrees of data replication. We verify both the modeling and the stack results against individual execution-driven simulations that consider realistic cache parameters and delays using a set of commercial multithreaded workloads. We also compare the simulation time saving with the stack simulation. The results show that stack simulation can accurately model the performance of various studied cache organizations with 2-9% error margins using only about 8% of the simulation time. The results also show that the effectiveness of various techniques for optimizing the CMP on-chip storage is closely related to the working sets of the workloads as well as the total cache sizes","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPASS.2007.363743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

The future chip-multiprocessors (CMPs) with a large number of cores faces difficult issues in efficient utilizing on-chip storage space. Tradeoffs between data accessibility and effective on-chip capacity have been studied extensively. It requires costly simulations to understand a wide-spectrum of design spaces. In this paper, we first develop an abstract model for understanding the performance impact with respect to the degree of data replication. To overcome the lack of real-time interactions among multiple cores in the abstract model, we propose an efficient single-pass stack simulation method to study the performance of a variety of cache organizations on CMPs. The proposed global stack logically incorporates a shared stack and per-core private stacks to collect shared/private reuse (stack) distances for every memory reference in a single simulation pass. With the collected reuse distances, performance in terms of hits/misses and average memory access times can be calculated for multiple cache organizations. The basic stack simulation results can further derive other CMP cache organizations with various degrees of data replication. We verify both the modeling and the stack results against individual execution-driven simulations that consider realistic cache parameters and delays using a set of commercial multithreaded workloads. We also compare the simulation time saving with the stack simulation. The results show that stack simulation can accurately model the performance of various studied cache organizations with 2-9% error margins using only about 8% of the simulation time. The results also show that the effectiveness of various techniques for optimizing the CMP on-chip storage is closely related to the working sets of the workloads as well as the total cache sizes

查看原文本刊更多论文

CMP缓存容量和可访问性的建模和单次仿真

未来具有大量核心的芯片多处理器(cmp)面临着如何有效利用片上存储空间的难题。数据可访问性和有效片上容量之间的权衡已经得到了广泛的研究。它需要昂贵的模拟来理解广泛的设计空间。在本文中，我们首先开发了一个抽象模型，用于理解与数据复制程度相关的性能影响。为了克服抽象模型中多核之间缺乏实时交互的问题，我们提出了一种高效的单通道堆栈模拟方法来研究各种缓存组织在cmp上的性能。建议的全局堆栈逻辑上包含共享堆栈和每核私有堆栈，以收集单个模拟通道中每个内存引用的共享/私有重用(堆栈)距离。有了收集到的重用距离，就可以计算多个缓存组织的命中/未命中性能和平均内存访问时间。基本的堆栈模拟结果可以进一步推导出具有不同程度数据复制的其他CMP缓存组织。我们使用一组商业多线程工作负载，根据单个执行驱动的模拟来验证建模和堆栈结果，这些模拟考虑了实际的缓存参数和延迟。我们还比较了模拟和堆栈模拟节省的时间。结果表明，堆栈模拟可以准确地模拟所研究的各种缓存组织的性能，误差范围为2-9%，仅使用约8%的模拟时间。结果还表明，优化CMP片上存储的各种技术的有效性与工作负载的工作集以及总缓存大小密切相关

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2007 IEEE International Symposium on Performance Analysis of Systems & Software

自引率

0.00%

发文量