通过编译器启动的一致性动作克服多处理器中预取的限制

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI:10.1109/PACT.1997.644023

J. Skeppstedt

{"title":"通过编译器启动的一致性动作克服多处理器中预取的限制","authors":"J. Skeppstedt","doi":"10.1109/PACT.1997.644023","DOIUrl":null,"url":null,"abstract":"In this paper we first identify limitations of compiler-controlled prefetching in a CC-NUMA multiprocessor with a write-invalidate cache coherence protocol. Compiler-controlled prefetch techniques for CC-NUMAs often are focused only, on stride-accesses, and this introduces a major limitation. We consider combining prefetch with two other compiler-controlled techniques to partly remedy the situation: (1) load-exclusive to reduce write-latency and (2) store-update to reduce read-latency. The purpose of each of these techniques in a machine with prefetch is to let them reduce latency for accesses which the prefetch technique could not handle. We evaluate two different scenarios, firstly with a hybrid compiler/hardware prefetch technique and secondly with an optimal stride-prefetcher. We find that the combined gains under the hybrid prefetch technique are significant for six applications we have studied: in average, 71% of the original write-stall time remains after using the hybrid prefetcher, and of these ownership-requests, 60% would be eliminated using load-exclusive; in average, 68% of the read-stall time remains after using the hybrid prefetcher and of these read-misses, 34% were serviced by remote caches and would be converted by store-update into misses serviced by a clean copy in memory which reduces the read-latency. With an optimal stride-prefetcher our results show that it beneficient to complement prefetch, with the two techniques here as well.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Overcoming limitations of prefetching in multiprocessors by compiler-initiated coherence actions\",\"authors\":\"J. Skeppstedt\",\"doi\":\"10.1109/PACT.1997.644023\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we first identify limitations of compiler-controlled prefetching in a CC-NUMA multiprocessor with a write-invalidate cache coherence protocol. Compiler-controlled prefetch techniques for CC-NUMAs often are focused only, on stride-accesses, and this introduces a major limitation. We consider combining prefetch with two other compiler-controlled techniques to partly remedy the situation: (1) load-exclusive to reduce write-latency and (2) store-update to reduce read-latency. The purpose of each of these techniques in a machine with prefetch is to let them reduce latency for accesses which the prefetch technique could not handle. We evaluate two different scenarios, firstly with a hybrid compiler/hardware prefetch technique and secondly with an optimal stride-prefetcher. We find that the combined gains under the hybrid prefetch technique are significant for six applications we have studied: in average, 71% of the original write-stall time remains after using the hybrid prefetcher, and of these ownership-requests, 60% would be eliminated using load-exclusive; in average, 68% of the read-stall time remains after using the hybrid prefetcher and of these read-misses, 34% were serviced by remote caches and would be converted by store-update into misses serviced by a clean copy in memory which reduces the read-latency. With an optimal stride-prefetcher our results show that it beneficient to complement prefetch, with the two techniques here as well.\",\"PeriodicalId\":177411,\"journal\":{\"name\":\"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1997-11-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PACT.1997.644023\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.1997.644023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

在本文中，我们首先确定了具有写无效缓存一致性协议的CC-NUMA多处理器中编译器控制的预取的局限性。cc - numa的编译器控制的预取技术通常只关注于步进访问，这引入了一个主要的限制。我们考虑将预取与另外两种编译器控制的技术结合起来，以部分补救这种情况:(1)负载排他(load-exclusive)以减少写延迟;(2)存储更新(store-update)以减少读延迟。在具有预取的机器中，这些技术的目的都是为了减少预取技术无法处理的访问延迟。我们评估了两种不同的场景，首先使用混合编译器/硬件预取技术，其次使用最佳步进预取器。我们发现混合预取技术下的综合收益对于我们研究的六个应用程序是显着的:平均而言，使用混合预取器后，71%的原始写暂停时间仍然存在，并且在这些所有权请求中，60%将使用负载独占性消除;平均而言，在使用混合预取器之后，68%的读取停顿时间仍然存在，其中34%的读取未命中由远程缓存处理，并将通过存储更新转换为由内存中的干净副本处理的未命中，从而减少了读取延迟。对于最优的步进预取器，我们的结果表明补充预取是有益的，这里的两种技术也是如此。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Overcoming limitations of prefetching in multiprocessors by compiler-initiated coherence actions

In this paper we first identify limitations of compiler-controlled prefetching in a CC-NUMA multiprocessor with a write-invalidate cache coherence protocol. Compiler-controlled prefetch techniques for CC-NUMAs often are focused only, on stride-accesses, and this introduces a major limitation. We consider combining prefetch with two other compiler-controlled techniques to partly remedy the situation: (1) load-exclusive to reduce write-latency and (2) store-update to reduce read-latency. The purpose of each of these techniques in a machine with prefetch is to let them reduce latency for accesses which the prefetch technique could not handle. We evaluate two different scenarios, firstly with a hybrid compiler/hardware prefetch technique and secondly with an optimal stride-prefetcher. We find that the combined gains under the hybrid prefetch technique are significant for six applications we have studied: in average, 71% of the original write-stall time remains after using the hybrid prefetcher, and of these ownership-requests, 60% would be eliminated using load-exclusive; in average, 68% of the read-stall time remains after using the hybrid prefetcher and of these read-misses, 34% were serviced by remote caches and would be converted by store-update into misses serviced by a clean copy in memory which reduces the read-latency. With an optimal stride-prefetcher our results show that it beneficient to complement prefetch, with the two techniques here as well.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques

自引率

0.00%

发文量