ZSim: fast and accurate microarchitectural simulation of thousand-core systems

Proceedings of the 40th Annual International Symposium on Computer Architecture Pub Date : 2013-06-23 DOI:10.1145/2485922.2485963

Daniel Sánchez, C. Kozyrakis

{"title":"ZSim: fast and accurate microarchitectural simulation of thousand-core systems","authors":"Daniel Sánchez, C. Kozyrakis","doi":"10.1145/2485922.2485963","DOIUrl":null,"url":null,"abstract":"Architectural simulation is time-consuming, and the trend towards hundreds of cores is making sequential simulation even slower. Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracy by allowing event reordering and using simplistic contention models. As a result, most researchers use sequential simulators and model small-scale systems with 16-32 cores. With 100-core chips already available, developing simulators that scale to thousands of cores is crucial. We present three novel techniques that, together, make thousand-core simulation practical. First, we speed up detailed core models (including OOO cores) with instruction-driven timing models that leverage dynamic binary translation. Second, we introduce bound-weave, a two-phase parallelization technique that scales parallel simulation on multicore hosts efficiently with minimal loss of accuracy. Third, we implement lightweight user-level virtualization to support complex workloads, including multiprogrammed, client-server, and managed-runtime applications, without the need for full-system simulation, sidestepping the lack of scalable OSs and ISAs that support thousands of cores. We use these techniques to build zsim, a fast, scalable, and accurate simulator. On a 16-core host, zsim models a 1024-core chip at speeds of up to 1,500 MIPS using simple cores and up to 300 MIPS using detailed OOO cores, 2-3 orders of magnitude faster than existing parallel simulators. Simulator performance scales well with both the number of modeled cores and the number of host cores. We validate zsim against a real Westmere system on a wide variety of workloads, and find performance and microarchitectural events to be within a narrow range of the real system.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"514","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 40th Annual International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2485922.2485963","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 514

Abstract

Architectural simulation is time-consuming, and the trend towards hundreds of cores is making sequential simulation even slower. Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracy by allowing event reordering and using simplistic contention models. As a result, most researchers use sequential simulators and model small-scale systems with 16-32 cores. With 100-core chips already available, developing simulators that scale to thousands of cores is crucial. We present three novel techniques that, together, make thousand-core simulation practical. First, we speed up detailed core models (including OOO cores) with instruction-driven timing models that leverage dynamic binary translation. Second, we introduce bound-weave, a two-phase parallelization technique that scales parallel simulation on multicore hosts efficiently with minimal loss of accuracy. Third, we implement lightweight user-level virtualization to support complex workloads, including multiprogrammed, client-server, and managed-runtime applications, without the need for full-system simulation, sidestepping the lack of scalable OSs and ISAs that support thousands of cores. We use these techniques to build zsim, a fast, scalable, and accurate simulator. On a 16-core host, zsim models a 1024-core chip at speeds of up to 1,500 MIPS using simple cores and up to 300 MIPS using detailed OOO cores, 2-3 orders of magnitude faster than existing parallel simulators. Simulator performance scales well with both the number of modeled cores and the number of host cores. We validate zsim against a real Westmere system on a wide variety of workloads, and find performance and microarchitectural events to be within a narrow range of the real system.

查看原文本刊更多论文

ZSim:快速准确的千核系统微架构仿真

架构模拟非常耗时，而且数百个内核的趋势使得顺序模拟更加缓慢。现有的并行模拟技术要么由于过度同步而伸缩性差，要么由于允许事件重新排序和使用简单的争用模型而牺牲准确性。因此，大多数研究人员使用顺序模拟器和模拟16-32核的小规模系统。由于100核芯片已经可用，开发可扩展到数千核的模拟器至关重要。我们提出了三种新技术，它们共同使千核模拟成为现实。首先，我们使用利用动态二进制转换的指令驱动计时模型来加速详细的核心模型(包括OOO核心)。其次，我们介绍了bound-weave，这是一种两相并行化技术，可以在多核主机上有效地扩展并行模拟，同时精度损失最小。第三，我们实现了轻量级的用户级虚拟化，以支持复杂的工作负载，包括多编程、客户机-服务器和托管运行时应用程序，而不需要进行全系统模拟，避免了缺乏支持数千个内核的可扩展操作系统和isa的问题。我们使用这些技术来构建zsim，一个快速，可扩展，准确的模拟器。在16核主机上，zsim模拟1024核芯片，使用简单核心的速度高达1,500 MIPS，使用详细的OOO核心的速度高达300 MIPS，比现有的并行模拟器快2-3个数量级。模拟器的性能随建模内核的数量和主机内核的数量都可以很好地扩展。我们针对实际的Westmere系统在各种工作负载上验证了zsim，并发现性能和微架构事件在实际系统的狭窄范围内。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 40th Annual International Symposium on Computer Architecture

自引率

0.00%

发文量