Scalable task parallelism for NUMA: A uniform abstraction for coordinated scheduling and memory management

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI:10.1145/2967938.2967946

Andi Drebes, Antoniu Pop, K. Heydemann, Albert Cohen, Nathalie Drach-Temam

{"title":"Scalable task parallelism for NUMA: A uniform abstraction for coordinated scheduling and memory management","authors":"Andi Drebes, Antoniu Pop, K. Heydemann, Albert Cohen, Nathalie Drach-Temam","doi":"10.1145/2967938.2967946","DOIUrl":null,"url":null,"abstract":"Dynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. Yet these promises are undermined by non-uniform memory access (NUMA). We show that using NUMA-aware task and data placement, it is possible to preserve the uniform abstraction of both computing and memory resources for task-parallel programming models while achieving high data locality. Our data placement scheme guarantees that all accesses to task output data target the local memory of the accessing core. The complementary task placement heuristic improves the locality of task input data on a best effort basis. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability by eliminating false dependences and enabling fine-grained dynamic control over data placement. The algorithms are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences readily available in the run-time system and placement information from the operating system. We achieve 94% of local memory accesses on a 192-core system with 24 NUMA nodes, up to 5× higher performance than NUMA-aware hierarchical work-stealing, and even 5.6× compared to static interleaved allocation. Finally, we show that state-of-the-art dynamic page migration by the operating system cannot catch up with frequent affinity changes between cores and data and thus fails to accelerate task-parallel applications.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2967938.2967946","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 39

Abstract

Dynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. Yet these promises are undermined by non-uniform memory access (NUMA). We show that using NUMA-aware task and data placement, it is possible to preserve the uniform abstraction of both computing and memory resources for task-parallel programming models while achieving high data locality. Our data placement scheme guarantees that all accesses to task output data target the local memory of the accessing core. The complementary task placement heuristic improves the locality of task input data on a best effort basis. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability by eliminating false dependences and enabling fine-grained dynamic control over data placement. The algorithms are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences readily available in the run-time system and placement information from the operating system. We achieve 94% of local memory accesses on a 192-core system with 24 NUMA nodes, up to 5× higher performance than NUMA-aware hierarchical work-stealing, and even 5.6× compared to static interleaved allocation. Finally, we show that state-of-the-art dynamic page migration by the operating system cannot catch up with frequent affinity changes between cores and data and thus fails to accelerate task-parallel applications.

查看原文本刊更多论文

NUMA的可伸缩任务并行性:用于协调调度和内存管理的统一抽象

动态任务并行编程模型在共享内存系统中很流行，有望增强可伸缩性、负载平衡和局域性。然而，这些承诺被非统一内存访问(NUMA)破坏了。我们表明，使用numa感知的任务和数据放置，可以在实现高数据局部性的同时，为任务并行编程模型保留计算和内存资源的统一抽象。我们的数据放置方案保证所有对任务输出数据的访问都以访问核心的本地内存为目标。互补任务布置启发式算法在最大努力的基础上改进了任务输入数据的局部性。我们的算法利用了数据流风格的任务并行性，其中任务数据的私有化通过消除虚假依赖和支持对数据放置的细粒度动态控制来增强可伸缩性。该算法是全自动的，独立于应用程序，跨NUMA机器的性能可移植性，并适应动态变化。放置决策使用有关运行时系统中随时可用的任务间数据依赖关系的信息和来自操作系统的放置信息。我们在拥有24个NUMA节点的192核系统上实现了94%的本地内存访问，比NUMA感知的分层工作窃取性能高出5倍，甚至比静态交错分配高出5.6倍。最后，我们展示了操作系统最先进的动态页面迁移无法跟上核心和数据之间频繁的关联变化，因此无法加速任务并行应用程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)

自引率

0.00%

发文量