kMAF: Automatic kernel-level management of thread and data affinity

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI:10.1145/2628071.2628085

M. Diener, E. Cruz, P. Navaux, Anselm Busse, Hans-Ulrich Heiß

{"title":"kMAF: Automatic kernel-level management of thread and data affinity","authors":"M. Diener, E. Cruz, P. Navaux, Anselm Busse, Hans-Ulrich Heiß","doi":"10.1145/2628071.2628085","DOIUrl":null,"url":null,"abstract":"One of the main challenges for parallel architectures is the increasing complexity of the memory hierarchy, which consists of several levels of private and shared caches, as well as interconnections between separate memories in NUMA machines. To make full use of this hierarchy, it is necessary to improve the locality of memory accesses by reducing accesses to remote caches and memories, and using local ones instead. Two techniques can be used to increase the memory access locality: executing threads and processes that access shared data close to each other in the memory hierarchy (thread affinity), and placing the memory pages they access on the NUMA node they are executing on (data affinity). Most related work in this area focuses on either thread or data affinity, but not both, which limits the improvements. Other mechanisms require expensive operations, such as memory access traces or binary analysis, require changes to hardware or work only on specific parallel APIs. In this paper, we introduce kMAF, a mechanism that automatically manages thread and data affinity on the kernel level. The memory access behavior of the running application is determined during its execution by analyzing its page faults. This information is used by kMAF to migrate threads and memory pages, such that the overall memory access locality is optimized. Extensive evaluation with 27 benchmarks from 4 benchmark suites shows substantial performance improvements, with results close to an oracle mechanism. Execution time was reduced by up to 35.7% (13.8% on average), while energy efficiency was improved by up to 34.6% (9.3% on average).","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"55","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2628071.2628085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 55

Abstract

One of the main challenges for parallel architectures is the increasing complexity of the memory hierarchy, which consists of several levels of private and shared caches, as well as interconnections between separate memories in NUMA machines. To make full use of this hierarchy, it is necessary to improve the locality of memory accesses by reducing accesses to remote caches and memories, and using local ones instead. Two techniques can be used to increase the memory access locality: executing threads and processes that access shared data close to each other in the memory hierarchy (thread affinity), and placing the memory pages they access on the NUMA node they are executing on (data affinity). Most related work in this area focuses on either thread or data affinity, but not both, which limits the improvements. Other mechanisms require expensive operations, such as memory access traces or binary analysis, require changes to hardware or work only on specific parallel APIs. In this paper, we introduce kMAF, a mechanism that automatically manages thread and data affinity on the kernel level. The memory access behavior of the running application is determined during its execution by analyzing its page faults. This information is used by kMAF to migrate threads and memory pages, such that the overall memory access locality is optimized. Extensive evaluation with 27 benchmarks from 4 benchmark suites shows substantial performance improvements, with results close to an oracle mechanism. Execution time was reduced by up to 35.7% (13.8% on average), while energy efficiency was improved by up to 34.6% (9.3% on average).

查看原文本刊更多论文

kMAF:线程和数据关联的自动内核级管理

并行体系结构面临的主要挑战之一是内存层次结构日益复杂，它由多个级别的私有和共享缓存以及NUMA机器中独立内存之间的互连组成。为了充分利用这种层次结构，有必要通过减少对远程缓存和内存的访问，并使用本地缓存和内存来改进内存访问的局部性。可以使用两种技术来增加内存访问局部性:执行访问内存层次结构中彼此靠近的共享数据的线程和进程(线程亲和性)，并将它们访问的内存页放在它们正在执行的NUMA节点上(数据亲和性)。这一领域的大多数相关工作要么关注线程关联，要么关注数据关联，但不是两者都关注，这限制了改进。其他机制需要昂贵的操作，如内存访问跟踪或二进制分析，需要更改硬件或仅在特定的并行api上工作。在本文中，我们介绍了kMAF，一种在内核级别自动管理线程和数据关联的机制。运行中的应用程序的内存访问行为是在其执行期间通过分析其页面错误来确定的。kMAF使用此信息来迁移线程和内存页，从而优化总体内存访问局部性。对来自4个基准套件的27个基准进行了广泛的评估，显示出了实质性的性能改进，结果接近于oracle机制。执行时间减少了35.7%(平均13.8%)，能源效率提高了34.6%(平均9.3%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 23rd International Conference on Parallel Architecture and Compilation (PACT)

自引率

0.00%

发文量