A tool to analyze the performance of multithreaded programs on NUMA architectures

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI:10.1145/2555243.2555271

Xu Liu, J. Mellor-Crummey

{"title":"A tool to analyze the performance of multithreaded programs on NUMA architectures","authors":"Xu Liu, J. Mellor-Crummey","doi":"10.1145/2555243.2555271","DOIUrl":null,"url":null,"abstract":"Almost all of today's microprocessors contain memory controllers and directly attach to memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is faster for a microprocessor to access memory that is directly attached than it is to access memory attached to another processor. Without careful distribution of computation and data, a multithreaded program running on such a system may have high average memory access latency. To use multiprocessor systems efficiently, programmers need performance tools to guide the design of NUMA-aware codes. To address this need, we enhanced the HPCToolkit performance tools to support measurement and analysis of performance problems on multiprocessor systems with multiple NUMA domains. With these extensions, HPCToolkit helps pinpoint, quantify, and analyze NUMA bottlenecks in executions of multithreaded programs. It computes derived metrics to assess the severity of bottlenecks, analyzes memory accesses, and provides a wealth of information to guide NUMA optimization, including information about how to distribute data to reduce access latency and minimize contention. This paper describes the design and implementation of our extensions to HPCToolkit. We demonstrate their utility by describing case studies in which we use these capabilities to diagnose NUMA bottlenecks in four multithreaded applications.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"69","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2555243.2555271","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 69

Abstract

Almost all of today's microprocessors contain memory controllers and directly attach to memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is faster for a microprocessor to access memory that is directly attached than it is to access memory attached to another processor. Without careful distribution of computation and data, a multithreaded program running on such a system may have high average memory access latency. To use multiprocessor systems efficiently, programmers need performance tools to guide the design of NUMA-aware codes. To address this need, we enhanced the HPCToolkit performance tools to support measurement and analysis of performance problems on multiprocessor systems with multiple NUMA domains. With these extensions, HPCToolkit helps pinpoint, quantify, and analyze NUMA bottlenecks in executions of multithreaded programs. It computes derived metrics to assess the severity of bottlenecks, analyzes memory accesses, and provides a wealth of information to guide NUMA optimization, including information about how to distribute data to reduce access latency and minimize contention. This paper describes the design and implementation of our extensions to HPCToolkit. We demonstrate their utility by describing case studies in which we use these capabilities to diagnose NUMA bottlenecks in four multithreaded applications.

查看原文本刊更多论文

一个分析NUMA架构上多线程程序性能的工具

今天几乎所有的微处理器都包含内存控制器并直接连接到内存。现代多处理器系统支持非统一内存访问(NUMA):微处理器访问直接连接的内存比访问连接到另一个处理器的内存要快。如果不仔细分配计算和数据，在这样的系统上运行的多线程程序可能会有很高的平均内存访问延迟。为了有效地使用多处理器系统，程序员需要性能工具来指导numa感知代码的设计。为了满足这一需求，我们增强了HPCToolkit性能工具，以支持在具有多个NUMA域的多处理器系统上测量和分析性能问题。通过这些扩展，HPCToolkit可以帮助查明、量化和分析多线程程序执行中的NUMA瓶颈。它计算派生指标来评估瓶颈的严重程度，分析内存访问，并提供丰富的信息来指导NUMA优化，包括关于如何分发数据以减少访问延迟和最小化争用的信息。本文描述了HPCToolkit扩展的设计和实现。我们通过描述案例研究来展示它们的实用性，在案例研究中，我们使用这些功能来诊断四个多线程应用程序中的NUMA瓶颈。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming

自引率

0.00%

发文量