Extended performance accounting using Valgrind tool

PROBLEMS IN PROGRAMMING Pub Date : 2021-06-01 DOI:10.15407/pp2021.02.054

D. Rahozin, A. Doroshenko

{"title":"Extended performance accounting using Valgrind tool","authors":"D. Rahozin, A. Doroshenko","doi":"10.15407/pp2021.02.054","DOIUrl":null,"url":null,"abstract":"Modern workloads, parallel or sequential, usually suffer from insufficient memory and computing performance. Common trends to improve workload performance include the utilizations of complex functional units or coprocessors, which are able not only to provide accelerated computations but also independently fetch data from memory generating complex address patterns, with or without support of control flow operations. Such coprocessors usually are not adopted by optimizing compilers and should be utilized by special application interfaces by hand. On the other hand, memory bottlenecks may be avoided with proper use of processor prefetch capabilities which load necessary data ahead of actual utilization time, and the prefetch is also adopted only for simple cases making programmers to do it usually by hand. As workloads are fast migrating to embedded applications a problem raises how to utilize all hardware capabilities for speeding up workload at moderate efforts. This requires precise analysis of memory access patterns at program run time and marking hot spots where the vast amount of memory accesses is issued. Precise memory access model can be analyzed via simulators, for example Valgrind, which is capable to run really big workload, for example neural network inference in reasonable time. But simulators and hardware performance analyzers fail to separate the full amount of memory references and cache misses per particular modules as it requires the analysis of program call graph. We are extending Valgrind tool cache simulator, which allows to account memory accesses per software modules and render realistic distribution of hot spot in a program. Additionally the analysis of address sequences in the simulator allows to recover array access patterns and propose effective prefetching schemes. Motivating samples are provided to illustrate the use of Valgrind tool.","PeriodicalId":313885,"journal":{"name":"PROBLEMS IN PROGRAMMING","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PROBLEMS IN PROGRAMMING","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15407/pp2021.02.054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Modern workloads, parallel or sequential, usually suffer from insufficient memory and computing performance. Common trends to improve workload performance include the utilizations of complex functional units or coprocessors, which are able not only to provide accelerated computations but also independently fetch data from memory generating complex address patterns, with or without support of control flow operations. Such coprocessors usually are not adopted by optimizing compilers and should be utilized by special application interfaces by hand. On the other hand, memory bottlenecks may be avoided with proper use of processor prefetch capabilities which load necessary data ahead of actual utilization time, and the prefetch is also adopted only for simple cases making programmers to do it usually by hand. As workloads are fast migrating to embedded applications a problem raises how to utilize all hardware capabilities for speeding up workload at moderate efforts. This requires precise analysis of memory access patterns at program run time and marking hot spots where the vast amount of memory accesses is issued. Precise memory access model can be analyzed via simulators, for example Valgrind, which is capable to run really big workload, for example neural network inference in reasonable time. But simulators and hardware performance analyzers fail to separate the full amount of memory references and cache misses per particular modules as it requires the analysis of program call graph. We are extending Valgrind tool cache simulator, which allows to account memory accesses per software modules and render realistic distribution of hot spot in a program. Additionally the analysis of address sequences in the simulator allows to recover array access patterns and propose effective prefetching schemes. Motivating samples are provided to illustrate the use of Valgrind tool.

查看原文本刊更多论文

使用Valgrind工具扩展绩效会计

现代工作负载，无论是并行的还是顺序的，通常都存在内存和计算性能不足的问题。提高工作负载性能的常见趋势包括使用复杂的功能单元或协处理器，它们不仅能够提供加速计算，而且能够独立地从内存中获取数据，生成复杂的地址模式，无论是否支持控制流操作。这种协处理器通常不会被优化编译器所采用，而应该由特殊的应用程序接口手工使用。另一方面，适当使用处理器预取功能可以避免内存瓶颈，它可以在实际使用时间之前加载必要的数据，并且预取也只用于简单的情况，使程序员通常手工操作。随着工作负载快速迁移到嵌入式应用程序，如何利用所有硬件功能以适度的努力加速工作负载的问题就出现了。这需要在程序运行时对内存访问模式进行精确分析，并标记发出大量内存访问的热点。精确的内存访问模型可以通过模拟器进行分析，例如Valgrind，它能够在合理的时间内运行非常大的工作量，例如神经网络推理。但是模拟器和硬件性能分析器无法分离每个特定模块的内存引用和缓存丢失的全部数量，因为它需要分析程序调用图。我们正在扩展Valgrind工具缓存模拟器，它允许计算每个软件模块的内存访问，并呈现程序中热点的真实分布。此外，模拟器中的地址序列分析允许恢复阵列访问模式并提出有效的预取方案。提供了激励示例来说明Valgrind工具的使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PROBLEMS IN PROGRAMMING

自引率

0.00%

发文量