Integration framework for online thread throttling with thread and page mapping on NUMA systems

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing Pub Date : 2025-07-04 DOI:10.1016/j.jpdc.2025.105145

Janaina Schwarzrock , Hiago Mayk G. de A. Rocha , Arthur F. Lorenzon , Samuel Xavier de Souza , Antonio Carlos S. Beck

{"title":"Integration framework for online thread throttling with thread and page mapping on NUMA systems","authors":"Janaina Schwarzrock , Hiago Mayk G. de A. Rocha , Arthur F. Lorenzon , Samuel Xavier de Souza , Antonio Carlos S. Beck","doi":"10.1016/j.jpdc.2025.105145","DOIUrl":null,"url":null,"abstract":"<div><div>Non-Uniform Memory Access (NUMA) systems are prevalent in HPC, where optimal thread-to-core allocation and page placement are crucial for enhancing performance and minimizing energy usage. Moreover, considering that NUMA systems have hardware support for a large number of hardware threads and many parallel applications have limited scalability, artificially decreasing the number of threads by using Dynamic Concurrency Throttling (DCT) may bring further improvements. However, the optimal configuration (thread mapping, page mapping, number of threads) for energy and performance, quantified by the Energy-Delay Product (EDP), varies with the system hardware, application and input set, even during execution. Because of this dynamic nature, adaptability is essential, making offline strategies much less effective. Despite their effectiveness, online strategies introduce additional execution overhead, which involves learning at run-time and the cost of transitions between configurations with cache warm-ups, thread and data reallocation. Thus, balancing the learning time and solution quality becomes increasingly significant. In this scenario, this work proposes a framework to find such optimal configurations into a single, online, and efficient approach. Our experimental evaluation shows that our framework improves EDP and performance compared to online state-of-the-art techniques of thread/page mapping (up to 69.3% and 43.4%) and DCT (up to 93.2% and 74.9%), while being totally adaptive and requiring minimum user intervention.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105145"},"PeriodicalIF":3.4000,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Parallel and Distributed Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0743731525001121","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Non-Uniform Memory Access (NUMA) systems are prevalent in HPC, where optimal thread-to-core allocation and page placement are crucial for enhancing performance and minimizing energy usage. Moreover, considering that NUMA systems have hardware support for a large number of hardware threads and many parallel applications have limited scalability, artificially decreasing the number of threads by using Dynamic Concurrency Throttling (DCT) may bring further improvements. However, the optimal configuration (thread mapping, page mapping, number of threads) for energy and performance, quantified by the Energy-Delay Product (EDP), varies with the system hardware, application and input set, even during execution. Because of this dynamic nature, adaptability is essential, making offline strategies much less effective. Despite their effectiveness, online strategies introduce additional execution overhead, which involves learning at run-time and the cost of transitions between configurations with cache warm-ups, thread and data reallocation. Thus, balancing the learning time and solution quality becomes increasingly significant. In this scenario, this work proposes a framework to find such optimal configurations into a single, online, and efficient approach. Our experimental evaluation shows that our framework improves EDP and performance compared to online state-of-the-art techniques of thread/page mapping (up to 69.3% and 43.4%) and DCT (up to 93.2% and 74.9%), while being totally adaptive and requiring minimum user intervention.

查看原文本刊更多论文

基于NUMA系统的线程和页面映射的在线线程节流集成框架

非统一内存访问（NUMA）系统在HPC中很普遍，其中最佳的线程到核心分配和页面放置对于提高性能和最小化能耗至关重要。此外，考虑到NUMA系统具有对大量硬件线程的硬件支持，并且许多并行应用程序具有有限的可伸缩性，通过使用动态并发节流（Dynamic Concurrency Throttling， DCT）人为地减少线程数量可能会带来进一步的改进。然而，能源和性能的最佳配置（线程映射、页面映射、线程数）（由能源延迟积（energy - delay Product， EDP）量化）随着系统硬件、应用程序和输入集的不同而变化，甚至在执行过程中也是如此。由于这种动态特性，适应性是必不可少的，这使得离线策略的有效性大大降低。尽管它们很有效，但是在线策略引入了额外的执行开销，包括在运行时学习，以及在配置之间转换的成本，包括缓存预热、线程和数据重新分配。因此，平衡学习时间和解决方案质量变得越来越重要。在这种情况下，本工作提出了一个框架，将这种最佳配置找到一个单一的、在线的、有效的方法。我们的实验评估表明，与在线最先进的线程/页面映射技术（高达69.3%和43.4%）和DCT（高达93.2%和74.9%）相比，我们的框架提高了EDP和性能，同时完全自适应并且需要最少的用户干预。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Parallel and Distributed Computing 工程技术-计算机：理论方法

CiteScore

10.30

自引率

2.60%

发文量

172

审稿时长

12 months

期刊介绍： This international journal is directed to researchers, engineers, educators, managers, programmers, and users of computers who have particular interests in parallel processing and/or distributed computing. The Journal of Parallel and Distributed Computing publishes original research papers and timely review articles on the theory, design, evaluation, and use of parallel and/or distributed computing systems. The journal also features special issues on these topics; again covering the full range from the design to the use of our targeted systems.