Optimizing inter-processor data locality on embedded chip multiprocessors

Proceedings of the 5th ACM international conference on Embedded software Pub Date : 2005-09-18 DOI:10.1145/1086228.1086271

Guilin Chen, M. Kandemir

{"title":"Optimizing inter-processor data locality on embedded chip multiprocessors","authors":"Guilin Chen, M. Kandemir","doi":"10.1145/1086228.1086271","DOIUrl":null,"url":null,"abstract":"Recent research in embedded computing indicates that packing multiple processor cores on the same die is an effective way of utilizing the ever-increasing number of transistors. The advantage of placing multiple cores into a single die is that it reduces on-chip communication costs (in terms of both execution cycles and power consumption) between the processor cores that are traditionally very high in conventional high-performance parallel architectures (such as SMPs). However, on the negative side, this tighter integration exerts an even higher pressure on off-chip accesses to the memory system. This makes minimizing the number of off-chip accesses a critical optimization goal.This paper discusses a compiler-based solution to this problem for the embedded applications that perform stencil computations. An important characteristic of this solution is that it distinguishes between the intra-processor data reuse and inter-processor data reuse. The first of these captures the data reuse that occurs across loop iterations assigned to the same processor, whereas the second one represents the data reuse that takes place across the loop iterations assigned to different processors. The proposed approach then optimizes inter-processor reuse by re-organizing the loop iterations of each processor carefully, considering how data elements are shared across processors. The goal is to ensure that the different processors access the shared data within a short period of time, so that the data can be captured in the on-chip memory space at the time of the reuse. This paper also presents an evaluation of the proposed optimization and compares it to an alternate scheme that optimizes data locality for each processor in isolation. The results obtained by applying our implementation to eight loop-intensive benchmark codes from the embedded computing domain show that our approach improves over the mentioned alternate scheme by 15.6% on average.","PeriodicalId":284648,"journal":{"name":"Proceedings of the 5th ACM international conference on Embedded software","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th ACM international conference on Embedded software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1086228.1086271","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Recent research in embedded computing indicates that packing multiple processor cores on the same die is an effective way of utilizing the ever-increasing number of transistors. The advantage of placing multiple cores into a single die is that it reduces on-chip communication costs (in terms of both execution cycles and power consumption) between the processor cores that are traditionally very high in conventional high-performance parallel architectures (such as SMPs). However, on the negative side, this tighter integration exerts an even higher pressure on off-chip accesses to the memory system. This makes minimizing the number of off-chip accesses a critical optimization goal.This paper discusses a compiler-based solution to this problem for the embedded applications that perform stencil computations. An important characteristic of this solution is that it distinguishes between the intra-processor data reuse and inter-processor data reuse. The first of these captures the data reuse that occurs across loop iterations assigned to the same processor, whereas the second one represents the data reuse that takes place across the loop iterations assigned to different processors. The proposed approach then optimizes inter-processor reuse by re-organizing the loop iterations of each processor carefully, considering how data elements are shared across processors. The goal is to ensure that the different processors access the shared data within a short period of time, so that the data can be captured in the on-chip memory space at the time of the reuse. This paper also presents an evaluation of the proposed optimization and compares it to an alternate scheme that optimizes data locality for each processor in isolation. The results obtained by applying our implementation to eight loop-intensive benchmark codes from the embedded computing domain show that our approach improves over the mentioned alternate scheme by 15.6% on average.

查看原文本刊更多论文

在嵌入式芯片多处理器上优化处理器间数据位置

嵌入式计算领域的最新研究表明，在同一个芯片上封装多个处理器内核是利用日益增加的晶体管数量的有效方法。将多个内核放入单个芯片的优点是，它降低了处理器内核之间的片上通信成本(就执行周期和功耗而言)，而在传统的高性能并行架构(如smp)中，处理器内核的通信成本通常非常高。然而，从负面来看，这种更紧密的集成对内存系统的片外访问施加了更大的压力。这使得最小化芯片外访问的数量成为一个关键的优化目标。本文讨论了一种基于编译器的解决方案，用于执行模板计算的嵌入式应用程序。该解决方案的一个重要特征是它区分了处理器内数据重用和处理器间数据重用。其中第一个捕获分配给相同处理器的循环迭代之间发生的数据重用，而第二个则表示分配给不同处理器的循环迭代之间发生的数据重用。然后，考虑到数据元素如何在处理器之间共享，所提出的方法通过仔细地重新组织每个处理器的循环迭代来优化处理器间重用。目标是确保不同的处理器在短时间内访问共享数据，以便在重用时可以在片上内存空间中捕获数据。本文还对所提出的优化进行了评估，并将其与另一种方案进行了比较，该方案为每个处理器单独优化数据局域性。将我们的实现应用于嵌入式计算领域的8个循环密集型基准代码的结果表明，我们的方法比上述替代方案平均提高了15.6%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 5th ACM international conference on Embedded software

自引率

0.00%

发文量