NSL-BLRL: efficient cache warmup for sampled processor simulation

39th Annual Simulation Symposium (ANSS'06) Pub Date : 2006-04-02 DOI:10.1109/ANSS.2006.33

Luk Van Ertvelde, Filip Hellebaut, L. Eeckhout, K. D. Bosschere

{"title":"NSL-BLRL: efficient cache warmup for sampled processor simulation","authors":"Luk Van Ertvelde, Filip Hellebaut, L. Eeckhout, K. D. Bosschere","doi":"10.1109/ANSS.2006.33","DOIUrl":null,"url":null,"abstract":"Architectural simulation is extremely time-consuming given the huge number of instructions that need to be simulated for contemporary benchmarks. Sampled simulation which selects a number of samples from the complete benchmark execution yields substantial speedups. However, there is one major issue that needs to be dealt with in order to minimize non-sampling bias, namely the hardware state at the beginning of each sample. This is well known in the literature as the cold-start problem. The hardware structures that suffer the most from the cold-start problem are cache hierarchies. In this paper, we propose NSL-BLRL which combines two previously proposed cache hierarchy warmup approaches, namely no-state-loss (NSL) and boundary line reuse latency (BLRL). The idea of NSL-BLRL is to warmup the cache hierarchy using a hardware state checkpoint that stores a truncated NSL stream. The NSL stream is a least-recently used stream of (unique) memory references in the pre-sample. This NSL stream is then truncated to form the NSL-BLRL warmup checkpoint; this is done by inspecting the sample for determining how far in the pre-sample one needs to go back to accurately warmup the hardware state for the given sample. We show using SPEC CPU2000 benchmarks that NSL-BLRL is (i) nearly as accurate as BLRL and NSL for sampled processor simulation, (ii) yields simulation time speedups of several orders of magnitude compared to BLRL, and (iii) is more space-efficient than NSL. As such, we conclude that NSL-BLRL is a highly efficient and accurate cache warmup strategy for sampled processor simulation.","PeriodicalId":308739,"journal":{"name":"39th Annual Simulation Symposium (ANSS'06)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"39th Annual Simulation Symposium (ANSS'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ANSS.2006.33","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

Architectural simulation is extremely time-consuming given the huge number of instructions that need to be simulated for contemporary benchmarks. Sampled simulation which selects a number of samples from the complete benchmark execution yields substantial speedups. However, there is one major issue that needs to be dealt with in order to minimize non-sampling bias, namely the hardware state at the beginning of each sample. This is well known in the literature as the cold-start problem. The hardware structures that suffer the most from the cold-start problem are cache hierarchies. In this paper, we propose NSL-BLRL which combines two previously proposed cache hierarchy warmup approaches, namely no-state-loss (NSL) and boundary line reuse latency (BLRL). The idea of NSL-BLRL is to warmup the cache hierarchy using a hardware state checkpoint that stores a truncated NSL stream. The NSL stream is a least-recently used stream of (unique) memory references in the pre-sample. This NSL stream is then truncated to form the NSL-BLRL warmup checkpoint; this is done by inspecting the sample for determining how far in the pre-sample one needs to go back to accurately warmup the hardware state for the given sample. We show using SPEC CPU2000 benchmarks that NSL-BLRL is (i) nearly as accurate as BLRL and NSL for sampled processor simulation, (ii) yields simulation time speedups of several orders of magnitude compared to BLRL, and (iii) is more space-efficient than NSL. As such, we conclude that NSL-BLRL is a highly efficient and accurate cache warmup strategy for sampled processor simulation.

查看原文本刊更多论文

NSL-BLRL:用于采样处理器仿真的高效缓存预热

考虑到需要为当代基准测试模拟大量指令，体系结构模拟非常耗时。采样模拟从完整的基准执行中选择大量的样本，从而产生显著的速度提升。然而，为了最小化非采样偏差，需要处理一个主要问题，即每个样本开始时的硬件状态。这在文献中被称为冷启动问题。受冷启动问题影响最大的硬件结构是缓存层次结构。在本文中，我们提出了NSL-BLRL，它结合了之前提出的两种缓存层次预热方法，即无状态损失(NSL)和边界线重用延迟(BLRL)。NSL- blrl的思想是使用存储截断的NSL流的硬件状态检查点来预热缓存层次结构。NSL流是预采样中最近最少使用的(唯一的)内存引用流。然后截断这个NSL流，形成NSL- blrl预热检查点;这是通过检查样本来完成的，以确定在预样本中需要返回多远才能准确地预热给定样本的硬件状态。我们使用SPEC CPU2000基准测试表明，NSL-BLRL (i)在采样处理器模拟中几乎与BLRL和NSL一样准确，(ii)与BLRL相比，仿真时间速度提高了几个数量级，(iii)比NSL更节省空间。因此，我们得出结论，NSL-BLRL是一种高效、准确的采样处理器模拟缓存预热策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

39th Annual Simulation Symposium (ANSS'06)

自引率

0.00%

发文量