Enabling performance portability of data-parallel OpenMP applications on asymmetric multicore processors

Juan Carlos Saez, Fernando Castro, Manuel Prieto-Matias
{"title":"Enabling performance portability of data-parallel OpenMP applications on asymmetric multicore processors","authors":"Juan Carlos Saez, Fernando Castro, Manuel Prieto-Matias","doi":"arxiv-2402.07664","DOIUrl":null,"url":null,"abstract":"Asymmetric multicore processors (AMPs) couple high-performance big cores and\nlow-power small cores with the same instruction-set architecture but different\nfeatures, such as clock frequency or microarchitecture. Previous work has shown\nthat asymmetric designs may deliver higher energy efficiency than symmetric\nmulticores for diverse workloads. Despite their benefits, AMPs pose significant\nchallenges to runtime systems of parallel programming models. While previous\nwork has mainly explored how to efficiently execute task-based parallel\napplications on AMPs, via enhancements in the runtime system, improving the\nperformance of unmodified data-parallel applications on these architectures is\nstill a big challenge. In this work we analyze the particular case of\nloop-based OpenMP applications, which are widely used today in scientific and\nengineering domains, and constitute the dominant application type in many\nparallel benchmark suites used for performance evaluation on multicore systems.\nWe observed that conventional loop-scheduling OpenMP approaches are unable to\nefficiently cope with the load imbalance that naturally stems from the\ndifferent performance delivered by big and small cores. To address this shortcoming, we propose \\textit{Asymmetric Iteration\nDistribution} (AID), a set of novel loop-scheduling methods for AMPs that\ndistribute iterations unevenly across worker threads to efficiently deal with\nperformance asymmetry. We implemented AID in \\textit{libgomp} --the GNU OpenMP\nruntime system--, and evaluated it on two different asymmetric multicore\nplatforms. Our analysis reveals that the AID methods constitute effective\nreplacements of the \\texttt{static} and \\texttt{dynamic} methods on AMPs, and\nare capable of improving performance over these conventional strategies by up\nto 56\\% and 16.8\\%, respectively.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2402.07664","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Asymmetric multicore processors (AMPs) couple high-performance big cores and low-power small cores with the same instruction-set architecture but different features, such as clock frequency or microarchitecture. Previous work has shown that asymmetric designs may deliver higher energy efficiency than symmetric multicores for diverse workloads. Despite their benefits, AMPs pose significant challenges to runtime systems of parallel programming models. While previous work has mainly explored how to efficiently execute task-based parallel applications on AMPs, via enhancements in the runtime system, improving the performance of unmodified data-parallel applications on these architectures is still a big challenge. In this work we analyze the particular case of loop-based OpenMP applications, which are widely used today in scientific and engineering domains, and constitute the dominant application type in many parallel benchmark suites used for performance evaluation on multicore systems. We observed that conventional loop-scheduling OpenMP approaches are unable to efficiently cope with the load imbalance that naturally stems from the different performance delivered by big and small cores. To address this shortcoming, we propose \textit{Asymmetric Iteration Distribution} (AID), a set of novel loop-scheduling methods for AMPs that distribute iterations unevenly across worker threads to efficiently deal with performance asymmetry. We implemented AID in \textit{libgomp} --the GNU OpenMP runtime system--, and evaluated it on two different asymmetric multicore platforms. Our analysis reveals that the AID methods constitute effective replacements of the \texttt{static} and \texttt{dynamic} methods on AMPs, and are capable of improving performance over these conventional strategies by up to 56\% and 16.8\%, respectively.
在非对称多核处理器上实现数据并行 OpenMP 应用程序的性能可移植性
非对称多核处理器(AMP)将高性能的大核和低功耗的小核结合在一起,它们采用相同的指令集架构,但具有不同的特性,如时钟频率或微架构。以往的工作表明,对于不同的工作负载,非对称设计可能比对称多核处理器提供更高的能效。尽管 AMP 有很多优点,但它对并行编程模型的运行时系统提出了巨大挑战。虽然以前的工作主要探讨了如何通过增强运行时系统在 AMP 上高效执行基于任务的并行应用,但在这些架构上提高未经修改的数据并行应用的性能仍然是一个巨大的挑战。在这项工作中,我们分析了基于循环的 OpenMP 应用程序的特殊情况,这些应用程序目前广泛应用于科学和工程领域,并在许多用于多核系统性能评估的并行基准套件中构成了主要的应用程序类型。我们发现,传统的循环调度 OpenMP 方法无法有效地应对负载不平衡问题,而负载不平衡问题自然是由大小核提供的不同性能造成的。为了解决这一缺陷,我们提出了 \textit{Asymmetric IterationDistribution} (AID),这是一套适用于 AMP 的新型循环调度方法,可以在工作线程之间不均匀地分配迭代,从而有效地处理性能不对称问题。我们在 \textit{libgomp} 中实现了 AID。textit{libgomp}--GNU OpenMPruntime系统--中实现了AID,并在两种不同的非对称多核平台上进行了评估。我们的分析表明,AID方法可以有效替代AMP上的(texttt{static})和(texttt{dynamic})方法,并且能够比这些传统策略分别提高56%和16.8%的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信