Linux系统上定向NUMA优化:高斯计算化学代码的案例研究

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI:10.1109/IPDPS.2011.100

Rui Yang, J. Antony, Alistair P. Rendell, D. Robson, P. Strazdins

{"title":"Linux系统上定向NUMA优化:高斯计算化学代码的案例研究","authors":"Rui Yang, J. Antony, Alistair P. Rendell, D. Robson, P. Strazdins","doi":"10.1109/IPDPS.2011.100","DOIUrl":null,"url":null,"abstract":"The parallel performance of applications running on Non-Uniform Memory Access (NUMA) platforms is strongly influenced by the relative placement of memory pages to the threads that access them. As a consequence there are Linux application programmer interfaces (APIs) to control this. For large parallel codes it can, however, be difficult to determine how and when to use these APIs. In this paper we introduce the \\texttt{NUMAgrind} profiling tool which can be used to simplify this process. It extends the \\texttt{Val grind} binary translation framework to include a model which incorporates cache coherency, memory locality domains and interconnect traffic for arbitrary NUMA topologies. \\ Using \\texttt{NUMAgrind}, cache misses can be mapped to memory locality domains, page access modes determined, and pages that are referenced by multiple threads quickly determined. We show how the \\texttt{NUMAgrind} tool can be used to guide the use of Linux memory and thread placement APIs in the Gaussian computational chemistry code. The performance of the code before and after use of these APIs is also presented for three different commodity NUMA platforms.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":"{\"title\":\"Profiling Directed NUMA Optimization on Linux Systems: A Case Study of the Gaussian Computational Chemistry Code\",\"authors\":\"Rui Yang, J. Antony, Alistair P. Rendell, D. Robson, P. Strazdins\",\"doi\":\"10.1109/IPDPS.2011.100\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The parallel performance of applications running on Non-Uniform Memory Access (NUMA) platforms is strongly influenced by the relative placement of memory pages to the threads that access them. As a consequence there are Linux application programmer interfaces (APIs) to control this. For large parallel codes it can, however, be difficult to determine how and when to use these APIs. In this paper we introduce the \\\\texttt{NUMAgrind} profiling tool which can be used to simplify this process. It extends the \\\\texttt{Val grind} binary translation framework to include a model which incorporates cache coherency, memory locality domains and interconnect traffic for arbitrary NUMA topologies. \\\\ Using \\\\texttt{NUMAgrind}, cache misses can be mapped to memory locality domains, page access modes determined, and pages that are referenced by multiple threads quickly determined. We show how the \\\\texttt{NUMAgrind} tool can be used to guide the use of Linux memory and thread placement APIs in the Gaussian computational chemistry code. The performance of the code before and after use of these APIs is also presented for three different commodity NUMA platforms.\",\"PeriodicalId\":355100,\"journal\":{\"name\":\"2011 IEEE International Parallel & Distributed Processing Symposium\",\"volume\":\"52 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"17\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE International Parallel & Distributed Processing Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS.2011.100\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Parallel & Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2011.100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

摘要

运行在非统一内存访问(NUMA)平台上的应用程序的并行性能很大程度上受到内存页相对于访问它们的线程的位置的影响。因此，有Linux应用程序编程接口(api)来控制这一点。然而，对于大型并行代码，很难确定如何以及何时使用这些api。本文介绍了\texttt{NUMAgrind}分析工具，该工具可以简化这一过程。它扩展了\texttt{Val grind}二进制转换框架，以包含一个模型，该模型结合了缓存一致性，内存局域性域和任意NUMA拓扑的互连流量。使用\texttt{NUMAgrind}，可以将缓存缺失映射到内存位置域，确定页面访问模式，并快速确定由多个线程引用的页面。我们将展示如何使用\texttt{NUMAgrind}工具来指导在高斯计算化学代码中使用Linux内存和线程放置api。在三个不同的商品NUMA平台上，还展示了使用这些api之前和之后的代码性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Profiling Directed NUMA Optimization on Linux Systems: A Case Study of the Gaussian Computational Chemistry Code

The parallel performance of applications running on Non-Uniform Memory Access (NUMA) platforms is strongly influenced by the relative placement of memory pages to the threads that access them. As a consequence there are Linux application programmer interfaces (APIs) to control this. For large parallel codes it can, however, be difficult to determine how and when to use these APIs. In this paper we introduce the \texttt{NUMAgrind} profiling tool which can be used to simplify this process. It extends the \texttt{Val grind} binary translation framework to include a model which incorporates cache coherency, memory locality domains and interconnect traffic for arbitrary NUMA topologies. \ Using \texttt{NUMAgrind}, cache misses can be mapped to memory locality domains, page access modes determined, and pages that are referenced by multiple threads quickly determined. We show how the \texttt{NUMAgrind} tool can be used to guide the use of Linux memory and thread placement APIs in the Gaussian computational chemistry code. The performance of the code before and after use of these APIs is also presented for three different commodity NUMA platforms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 IEEE International Parallel & Distributed Processing Symposium

自引率

0.00%

发文量