映射应用程序在多线程，NUMA系统上的高性能

ACM International Conference on Computing Frontiers Pub Date : 2013-05-14 DOI:10.1145/2482767.2482777

Guojing Cong, H. Wen

{"title":"映射应用程序在多线程，NUMA系统上的高性能","authors":"Guojing Cong, H. Wen","doi":"10.1145/2482767.2482777","DOIUrl":null,"url":null,"abstract":"The communication latency and available resources for a group of logical processors are determined by their relative position in the hierarchy of chips, cores, and threads on modern shared-memory systems. Multithreaded applications exhibit different performance behavior depending on the mapping of software threads to logical processors. We observe the execution time under one mapping can be 5.4 times as much as that under another. Applications with irregular access patterns show the worst performance under the default OS mapping.\n Mapping alone does not reduce remote accesses on NUMA machines when the logical processors span multiple chips. We present new data replication and distribution optimizations for two irregular applications. We further show that locality optimization reduces remote accesses and improves cache performance simultaneously and achieves better performance than prior NUMA-specific techniques.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Mapping applications for high performance on multithreaded, NUMA systems\",\"authors\":\"Guojing Cong, H. Wen\",\"doi\":\"10.1145/2482767.2482777\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The communication latency and available resources for a group of logical processors are determined by their relative position in the hierarchy of chips, cores, and threads on modern shared-memory systems. Multithreaded applications exhibit different performance behavior depending on the mapping of software threads to logical processors. We observe the execution time under one mapping can be 5.4 times as much as that under another. Applications with irregular access patterns show the worst performance under the default OS mapping.\\n Mapping alone does not reduce remote accesses on NUMA machines when the logical processors span multiple chips. We present new data replication and distribution optimizations for two irregular applications. We further show that locality optimization reduces remote accesses and improves cache performance simultaneously and achieves better performance than prior NUMA-specific techniques.\",\"PeriodicalId\":430420,\"journal\":{\"name\":\"ACM International Conference on Computing Frontiers\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-05-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM International Conference on Computing Frontiers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2482767.2482777\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2482767.2482777","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

一组逻辑处理器的通信延迟和可用资源是由它们在现代共享内存系统上的芯片、内核和线程层次结构中的相对位置决定的。根据软件线程到逻辑处理器的映射，多线程应用程序表现出不同的性能行为。我们观察到，一个映射下的执行时间可能是另一个映射下的5.4倍。具有不规则访问模式的应用程序在默认操作系统映射下表现出最差的性能。当逻辑处理器跨越多个芯片时，单独的映射并不能减少NUMA机器上的远程访问。我们为两个不规则应用程序提供了新的数据复制和分布优化。我们进一步表明，局域优化减少了远程访问，同时提高了缓存性能，并且比以前的numa特定技术实现了更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Mapping applications for high performance on multithreaded, NUMA systems

The communication latency and available resources for a group of logical processors are determined by their relative position in the hierarchy of chips, cores, and threads on modern shared-memory systems. Multithreaded applications exhibit different performance behavior depending on the mapping of software threads to logical processors. We observe the execution time under one mapping can be 5.4 times as much as that under another. Applications with irregular access patterns show the worst performance under the default OS mapping. Mapping alone does not reduce remote accesses on NUMA machines when the logical processors span multiple chips. We present new data replication and distribution optimizations for two irregular applications. We further show that locality optimization reduces remote accesses and improves cache performance simultaneously and achieves better performance than prior NUMA-specific techniques.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM International Conference on Computing Frontiers

自引率

0.00%

发文量