应用程序内存带宽和内存访问延迟的正确测量

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI:10.1145/3368474.3368476

Christian Helm, K. Taura

{"title":"应用程序内存带宽和内存访问延迟的正确测量","authors":"Christian Helm, K. Taura","doi":"10.1145/3368474.3368476","DOIUrl":null,"url":null,"abstract":"Diagnosing if an application suffers from DRAM contention can be a challenging task. One method is to compare the hardware memory bandwidth limit with the measured memory bandwidth of an application. Another method is based on memory access latency. The latency of a DRAM access in an uncontended state is a hardware characteristic. If an application shows higher DRAM access latency, the increase comes from queuing delays and the application is limited by DRAM bandwidth. Hardware-based measurement of the application's latency and bandwidth can be done with low-overhead and is agnostic of the application's implementation. But the practical implementation of such a diagnosis system on CPUs is difficult. In modern CPUs, there is an abundance of performance counters and only superficial documentation. Different types of counters for bandwidth or latency, that seemingly measure the same thing, produce different results. There is no in-depth understanding of those performance counters and naive usage may lead to incorrect measurements. Because there is no hardware feature to measure DRAM access latency directly, the implementation of the above-mentioned latency based method may seem impossible. In this paper, we compare various hardware latency and bandwidth measurement methods on CPUs by using micro-benchmarks. We show results of Intel Haswell, Broadwell and Skylake systems. With our experiments, we show how and why performance counters for bandwidth and latency differ. Only the counters inside of the memory controller correctly measure bandwidth. Latency measured by instruction sampling is suitable to find DRAM contention, even though it is not a pure DRAM access latency.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency\",\"authors\":\"Christian Helm, K. Taura\",\"doi\":\"10.1145/3368474.3368476\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Diagnosing if an application suffers from DRAM contention can be a challenging task. One method is to compare the hardware memory bandwidth limit with the measured memory bandwidth of an application. Another method is based on memory access latency. The latency of a DRAM access in an uncontended state is a hardware characteristic. If an application shows higher DRAM access latency, the increase comes from queuing delays and the application is limited by DRAM bandwidth. Hardware-based measurement of the application's latency and bandwidth can be done with low-overhead and is agnostic of the application's implementation. But the practical implementation of such a diagnosis system on CPUs is difficult. In modern CPUs, there is an abundance of performance counters and only superficial documentation. Different types of counters for bandwidth or latency, that seemingly measure the same thing, produce different results. There is no in-depth understanding of those performance counters and naive usage may lead to incorrect measurements. Because there is no hardware feature to measure DRAM access latency directly, the implementation of the above-mentioned latency based method may seem impossible. In this paper, we compare various hardware latency and bandwidth measurement methods on CPUs by using micro-benchmarks. We show results of Intel Haswell, Broadwell and Skylake systems. With our experiments, we show how and why performance counters for bandwidth and latency differ. Only the counters inside of the memory controller correctly measure bandwidth. Latency measured by instruction sampling is suitable to find DRAM contention, even though it is not a pure DRAM access latency.\",\"PeriodicalId\":314778,\"journal\":{\"name\":\"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-01-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3368474.3368476\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3368474.3368476","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

诊断应用程序是否存在DRAM争用可能是一项具有挑战性的任务。一种方法是将硬件内存带宽限制与应用程序的测量内存带宽进行比较。另一种方法是基于内存访问延迟。在无争用状态下访问DRAM的延迟是一种硬件特性。如果应用程序显示较高的DRAM访问延迟，则增加来自排队延迟，并且应用程序受到DRAM带宽的限制。基于硬件的应用程序延迟和带宽测量可以以低开销完成，并且与应用程序的实现无关。但是这种诊断系统在cpu上的实际实现比较困难。在现代cpu中，有大量的性能计数器，但只有肤浅的文档。不同类型的带宽或延迟计数器，虽然看起来测量的是同一件事，但却产生了不同的结果。对这些性能计数器没有深入的了解，天真的使用可能导致不正确的测量。由于没有直接测量DRAM访问延迟的硬件功能，因此上述基于延迟的方法的实现似乎是不可能的。在本文中，我们通过微基准测试比较了cpu上各种硬件延迟和带宽测量方法。我们展示了英特尔Haswell, Broadwell和Skylake系统的结果。通过我们的实验，我们展示了带宽和延迟的性能计数器如何以及为什么不同。只有内存控制器内部的计数器才能正确测量带宽。通过指令采样测量的延迟适合于发现DRAM争用，即使它不是纯粹的DRAM访问延迟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency

Diagnosing if an application suffers from DRAM contention can be a challenging task. One method is to compare the hardware memory bandwidth limit with the measured memory bandwidth of an application. Another method is based on memory access latency. The latency of a DRAM access in an uncontended state is a hardware characteristic. If an application shows higher DRAM access latency, the increase comes from queuing delays and the application is limited by DRAM bandwidth. Hardware-based measurement of the application's latency and bandwidth can be done with low-overhead and is agnostic of the application's implementation. But the practical implementation of such a diagnosis system on CPUs is difficult. In modern CPUs, there is an abundance of performance counters and only superficial documentation. Different types of counters for bandwidth or latency, that seemingly measure the same thing, produce different results. There is no in-depth understanding of those performance counters and naive usage may lead to incorrect measurements. Because there is no hardware feature to measure DRAM access latency directly, the implementation of the above-mentioned latency based method may seem impossible. In this paper, we compare various hardware latency and bandwidth measurement methods on CPUs by using micro-benchmarks. We show results of Intel Haswell, Broadwell and Skylake systems. With our experiments, we show how and why performance counters for bandwidth and latency differ. Only the counters inside of the memory controller correctly measure bandwidth. Latency measured by instruction sampling is suitable to find DRAM contention, even though it is not a pure DRAM access latency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

自引率

0.00%

发文量