{"title":"提高现代共享内存处理器内存系统仿真的精度","authors":"Yuetsu Kodama, Tetsuya Odajima, A. Asato, M. Sato","doi":"10.1145/3368474.3368483","DOIUrl":null,"url":null,"abstract":"For the purpose of developing applications for supercomputer Fugaku at an early stage, RIKEN has developed a processor simulator. This simulator is based on the general-purpose processor simulator gem5. It does not simulate the actual hardware of a Fugaku processor. However, we believe that sufficient simulation accuracy can be obtained since it simulates the instruction pipeline of out-of-order execution with cycle-level accuracy along with performing detailed parameter tuning of out-of-order resources. In order to estimate the accurate execution time of a program, it is necessary to simulate with accuracy not only the instruction execution time, but also the access time of the cache memory hierarchy. Therefore, in the RIKEN simulator, we expanded gem5 to match the performance of the cache memory hierarchy to that of a Fugaku processor. In this simulator, we aim to estimate the execution cycles of one node application on a Fugaku processor with accuracy that enables relative evaluation and application tuning. In this paper, we show the details of the implementation of this simulator and verify its accuracy compared with that of a Fugaku processor test chip. In the evaluation of the total 46 kernel benchmarks, it was confirmed that the difference is 13% or less for 85% of the kernels. In the multithreaded execution of Stream Triad benchmark, scalable performance according to the number of threads was confirmed, and achieved over 80% of memory throughput with enough accuracy.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Accuracy Improvement of Memory System Simulation for Modern Shared Memory Processor\",\"authors\":\"Yuetsu Kodama, Tetsuya Odajima, A. Asato, M. Sato\",\"doi\":\"10.1145/3368474.3368483\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For the purpose of developing applications for supercomputer Fugaku at an early stage, RIKEN has developed a processor simulator. This simulator is based on the general-purpose processor simulator gem5. It does not simulate the actual hardware of a Fugaku processor. However, we believe that sufficient simulation accuracy can be obtained since it simulates the instruction pipeline of out-of-order execution with cycle-level accuracy along with performing detailed parameter tuning of out-of-order resources. In order to estimate the accurate execution time of a program, it is necessary to simulate with accuracy not only the instruction execution time, but also the access time of the cache memory hierarchy. Therefore, in the RIKEN simulator, we expanded gem5 to match the performance of the cache memory hierarchy to that of a Fugaku processor. In this simulator, we aim to estimate the execution cycles of one node application on a Fugaku processor with accuracy that enables relative evaluation and application tuning. In this paper, we show the details of the implementation of this simulator and verify its accuracy compared with that of a Fugaku processor test chip. In the evaluation of the total 46 kernel benchmarks, it was confirmed that the difference is 13% or less for 85% of the kernels. In the multithreaded execution of Stream Triad benchmark, scalable performance according to the number of threads was confirmed, and achieved over 80% of memory throughput with enough accuracy.\",\"PeriodicalId\":314778,\"journal\":{\"name\":\"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-01-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3368474.3368483\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3368474.3368483","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Accuracy Improvement of Memory System Simulation for Modern Shared Memory Processor
For the purpose of developing applications for supercomputer Fugaku at an early stage, RIKEN has developed a processor simulator. This simulator is based on the general-purpose processor simulator gem5. It does not simulate the actual hardware of a Fugaku processor. However, we believe that sufficient simulation accuracy can be obtained since it simulates the instruction pipeline of out-of-order execution with cycle-level accuracy along with performing detailed parameter tuning of out-of-order resources. In order to estimate the accurate execution time of a program, it is necessary to simulate with accuracy not only the instruction execution time, but also the access time of the cache memory hierarchy. Therefore, in the RIKEN simulator, we expanded gem5 to match the performance of the cache memory hierarchy to that of a Fugaku processor. In this simulator, we aim to estimate the execution cycles of one node application on a Fugaku processor with accuracy that enables relative evaluation and application tuning. In this paper, we show the details of the implementation of this simulator and verify its accuracy compared with that of a Fugaku processor test chip. In the evaluation of the total 46 kernel benchmarks, it was confirmed that the difference is 13% or less for 85% of the kernels. In the multithreaded execution of Stream Triad benchmark, scalable performance according to the number of threads was confirmed, and achieved over 80% of memory throughput with enough accuracy.