Inter-Cluster Thread-to-Core Mapping and DVFS on Heterogeneous Multi-Cores

IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2017-09-26 DOI:10.1109/TMSCS.2017.2755619

Basireddy Karunakar Reddy;Amit Kumar Singh;Dwaipayan Biswas;Geoff V. Merrett;Bashir M. Al-Hashimi

{"title":"Inter-Cluster Thread-to-Core Mapping and DVFS on Heterogeneous Multi-Cores","authors":"Basireddy Karunakar Reddy;Amit Kumar Singh;Dwaipayan Biswas;Geoff V. Merrett;Bashir M. Al-Hashimi","doi":"10.1109/TMSCS.2017.2755619","DOIUrl":null,"url":null,"abstract":"Heterogeneous multi-core platforms that contain different types of cores, organized as clusters, are emerging, e.g., ARM's big.LITTLE architecture. These platforms often need to deal with multiple applications, having different performance requirements, executing concurrently. This leads to the generation of varying and mixed workloads (e.g., compute and memory intensive) due to resource sharing. Run-time management is required for adapting to such performance requirements and workload variabilities and to achieve energy efficiency. Moreover, the management becomes challenging when the applications are multi-threaded and the heterogeneity needs to be exploited. The existing run-time management approaches do not efficiently exploit cores situated in different clusters simultaneously (referred to as inter-cluster exploitation) and DVFS potential of cores, which is the aim of this paper. Such exploitation might help to satisfy the performance requirement while achieving energy savings at the same time. Therefore, in this paper, we propose a run-time management approach that first selects thread-to-core mapping based on the performance requirements and resource availability. Then, it applies online adaptation by adjusting the voltage-frequency (V-f) levels to achieve energy optimization, without trading-off application performance. For thread-to-core mapping, offline profiled results are used, which contain performance and energy characteristics of applications when executed on the heterogeneous platform by using different types of cores in various possible combinations. For an application, thread-to-core mapping process defines the number of used cores and their type, which are situated in different clusters. The online adaptation process classifies the inherent workload characteristics of concurrently executing applications, incurring a lower overhead than existing learning-based approaches as demonstrated in this paper. The classification of workload is performed using the metric Memory Reads Per Instruction (MRPI). The adaptation process pro-actively selects an appropriate V-f pair for a predicted workload. Subsequently, it monitors the workload prediction error and performance loss, quantified by instructions per second (IPS), and adjusts the chosen V-f to compensate. We validate the proposed run-time management approach on a hardware platform, the Odroid-XU3, with various combinations of multi-threaded applications from PARSEC and SPLASH benchmarks. Results show an average improvement in energy efficiency up to 33 percent compared to existing approaches while meeting the performance requirements.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"369-382"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2755619","citationCount":"54","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multi-Scale Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/8051086/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 54

Abstract

Heterogeneous multi-core platforms that contain different types of cores, organized as clusters, are emerging, e.g., ARM's big.LITTLE architecture. These platforms often need to deal with multiple applications, having different performance requirements, executing concurrently. This leads to the generation of varying and mixed workloads (e.g., compute and memory intensive) due to resource sharing. Run-time management is required for adapting to such performance requirements and workload variabilities and to achieve energy efficiency. Moreover, the management becomes challenging when the applications are multi-threaded and the heterogeneity needs to be exploited. The existing run-time management approaches do not efficiently exploit cores situated in different clusters simultaneously (referred to as inter-cluster exploitation) and DVFS potential of cores, which is the aim of this paper. Such exploitation might help to satisfy the performance requirement while achieving energy savings at the same time. Therefore, in this paper, we propose a run-time management approach that first selects thread-to-core mapping based on the performance requirements and resource availability. Then, it applies online adaptation by adjusting the voltage-frequency (V-f) levels to achieve energy optimization, without trading-off application performance. For thread-to-core mapping, offline profiled results are used, which contain performance and energy characteristics of applications when executed on the heterogeneous platform by using different types of cores in various possible combinations. For an application, thread-to-core mapping process defines the number of used cores and their type, which are situated in different clusters. The online adaptation process classifies the inherent workload characteristics of concurrently executing applications, incurring a lower overhead than existing learning-based approaches as demonstrated in this paper. The classification of workload is performed using the metric Memory Reads Per Instruction (MRPI). The adaptation process pro-actively selects an appropriate V-f pair for a predicted workload. Subsequently, it monitors the workload prediction error and performance loss, quantified by instructions per second (IPS), and adjusts the chosen V-f to compensate. We validate the proposed run-time management approach on a hardware platform, the Odroid-XU3, with various combinations of multi-threaded applications from PARSEC and SPLASH benchmarks. Results show an average improvement in energy efficiency up to 33 percent compared to existing approaches while meeting the performance requirements.

查看原文本刊更多论文

异构多核上的集群间线程到核映射和DVFS

包含不同类型内核的异构多核平台正在出现，这些内核被组织为集群，例如ARM的大型平台。小型建筑。这些平台通常需要处理多个应用程序，这些应用程序具有不同的性能要求，并同时执行。由于资源共享，这导致产生变化和混合的工作负载（例如，计算和内存密集型）。需要运行时管理来适应这种性能要求和工作负载的可变性，并实现能源效率。此外，当应用程序是多线程的并且需要利用异构性时，管理变得具有挑战性。现有的运行时管理方法不能有效地同时利用位于不同集群中的核心（称为集群间利用）和核心的DVFS潜力，这就是本文的目的。这种利用可能有助于满足性能要求，同时实现节能。因此，在本文中，我们提出了一种运行时管理方法，该方法首先根据性能要求和资源可用性选择线程到核心的映射。然后，它通过调整电压频率（V-f）水平来应用在线自适应，以实现能量优化，而不会牺牲应用性能。对于线程到核心的映射，使用离线分析结果，其中包含应用程序在异构平台上通过使用各种可能组合的不同类型的核心执行时的性能和能量特征。对于应用程序，线程到核心的映射过程定义了所使用的核心的数量及其类型，这些核心位于不同的集群中。在线自适应过程对并发执行应用程序的固有工作负载特性进行了分类，与本文所示的现有基于学习的方法相比，其开销更低。工作负载的分类是使用度量“每条指令的内存读取数”（MRPI）来执行的。自适应过程主动地为预测的工作负载选择适当的V-f对。随后，它监测工作负载预测误差和性能损失，通过每秒指令数（IPS）进行量化，并调整所选的V-f进行补偿。我们在硬件平台Odroid-XU3上验证了所提出的运行时管理方法，该平台具有来自PARSEC和SPLASH基准测试的多线程应用程序的各种组合。结果显示，在满足性能要求的同时，与现有方法相比，能效平均提高了33%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multi-Scale Computing Systems

自引率

0.00%

发文量