面向多核架构的编译器定向数据局部性优化

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI:10.1109/PACT.2011.24

W. Ding, Jithendra Srinivas, M. Kandemir, Mustafa Karaköy

{"title":"面向多核架构的编译器定向数据局部性优化","authors":"W. Ding, Jithendra Srinivas, M. Kandemir, Mustafa Karaköy","doi":"10.1109/PACT.2011.24","DOIUrl":null,"url":null,"abstract":"This paper presents and evaluates a cache hierarchy-aware code parallelization/mapping and scheduling strategy for multicore architectures. Our proposed parallelization/mapping strategy determines a loop iteration-to-core mapping by taking into account the data access pattern of an application and the on-chip cache hierarchy of a target architecture. The goal of this step is to maximize data locality at each level of caches while minimizing the data dependences across the cores. Our scheduling strategy on the other hand determines a schedule for the iterations assigned to each core in the target architecture, with the goal of satisfying all the data dependences in the code (both intra-core and inter-core) and reducing data reuse distances across the cores that share data. We formulate both parallelization/mapping problem and scheduling problem in a linear algebraic framework and solve them using the Farkas Lemma and the Integer Fourier-Motzkin Elimination. To measure the effectiveness of our schemes, we implemented them in a compiler and tested them using eight multithreaded application programs on a multicore machine. Our results show that the proposed mapping scheme reduces cache miss rates at all levels of the cache hierarchy and improves execution time of applications significantly, compared to alternate approaches, and when supported by scheduling, the improvements in cache miss rates and execution time become much larger.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Compiler Directed Data Locality Optimization for Multicore Architectures\",\"authors\":\"W. Ding, Jithendra Srinivas, M. Kandemir, Mustafa Karaköy\",\"doi\":\"10.1109/PACT.2011.24\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents and evaluates a cache hierarchy-aware code parallelization/mapping and scheduling strategy for multicore architectures. Our proposed parallelization/mapping strategy determines a loop iteration-to-core mapping by taking into account the data access pattern of an application and the on-chip cache hierarchy of a target architecture. The goal of this step is to maximize data locality at each level of caches while minimizing the data dependences across the cores. Our scheduling strategy on the other hand determines a schedule for the iterations assigned to each core in the target architecture, with the goal of satisfying all the data dependences in the code (both intra-core and inter-core) and reducing data reuse distances across the cores that share data. We formulate both parallelization/mapping problem and scheduling problem in a linear algebraic framework and solve them using the Farkas Lemma and the Integer Fourier-Motzkin Elimination. To measure the effectiveness of our schemes, we implemented them in a compiler and tested them using eight multithreaded application programs on a multicore machine. Our results show that the proposed mapping scheme reduces cache miss rates at all levels of the cache hierarchy and improves execution time of applications significantly, compared to alternate approaches, and when supported by scheduling, the improvements in cache miss rates and execution time become much larger.\",\"PeriodicalId\":106423,\"journal\":{\"name\":\"2011 International Conference on Parallel Architectures and Compilation Techniques\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 International Conference on Parallel Architectures and Compilation Techniques\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PACT.2011.24\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference on Parallel Architectures and Compilation Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2011.24","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文提出并评估了一种多核体系结构中缓存层次感知的代码并行/映射和调度策略。我们提出的并行化/映射策略通过考虑应用程序的数据访问模式和目标体系结构的片上缓存层次结构来确定循环迭代到核心的映射。此步骤的目标是最大化每个缓存级别的数据局部性，同时最小化跨核心的数据依赖性。另一方面，我们的调度策略决定了分配给目标体系结构中每个核心的迭代的调度，其目标是满足代码中的所有数据依赖(包括核心内和核心间)，并减少共享数据的核心之间的数据重用距离。将并行化/映射问题和调度问题形式化于线性代数框架中，并利用Farkas引理和整数傅里叶-莫兹金消去法求解。为了衡量我们的方案的有效性，我们在编译器中实现了它们，并在多核机器上使用八个多线程应用程序对它们进行了测试。我们的研究结果表明，与其他方法相比，所提出的映射方案降低了所有缓存层次结构的缓存丢失率，并显著提高了应用程序的执行时间，并且在调度支持下，缓存丢失率和执行时间的改进幅度要大得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Compiler Directed Data Locality Optimization for Multicore Architectures

This paper presents and evaluates a cache hierarchy-aware code parallelization/mapping and scheduling strategy for multicore architectures. Our proposed parallelization/mapping strategy determines a loop iteration-to-core mapping by taking into account the data access pattern of an application and the on-chip cache hierarchy of a target architecture. The goal of this step is to maximize data locality at each level of caches while minimizing the data dependences across the cores. Our scheduling strategy on the other hand determines a schedule for the iterations assigned to each core in the target architecture, with the goal of satisfying all the data dependences in the code (both intra-core and inter-core) and reducing data reuse distances across the cores that share data. We formulate both parallelization/mapping problem and scheduling problem in a linear algebraic framework and solve them using the Farkas Lemma and the Integer Fourier-Motzkin Elimination. To measure the effectiveness of our schemes, we implemented them in a compiler and tested them using eight multithreaded application programs on a multicore machine. Our results show that the proposed mapping scheme reduces cache miss rates at all levels of the cache hierarchy and improves execution time of applications significantly, compared to alternate approaches, and when supported by scheduling, the improvements in cache miss rates and execution time become much larger.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 International Conference on Parallel Architectures and Compilation Techniques

自引率

0.00%

发文量