{"title":"通过编译器启动的一致性动作克服多处理器中预取的限制","authors":"J. Skeppstedt","doi":"10.1109/PACT.1997.644023","DOIUrl":null,"url":null,"abstract":"In this paper we first identify limitations of compiler-controlled prefetching in a CC-NUMA multiprocessor with a write-invalidate cache coherence protocol. Compiler-controlled prefetch techniques for CC-NUMAs often are focused only, on stride-accesses, and this introduces a major limitation. We consider combining prefetch with two other compiler-controlled techniques to partly remedy the situation: (1) load-exclusive to reduce write-latency and (2) store-update to reduce read-latency. The purpose of each of these techniques in a machine with prefetch is to let them reduce latency for accesses which the prefetch technique could not handle. We evaluate two different scenarios, firstly with a hybrid compiler/hardware prefetch technique and secondly with an optimal stride-prefetcher. We find that the combined gains under the hybrid prefetch technique are significant for six applications we have studied: in average, 71% of the original write-stall time remains after using the hybrid prefetcher, and of these ownership-requests, 60% would be eliminated using load-exclusive; in average, 68% of the read-stall time remains after using the hybrid prefetcher and of these read-misses, 34% were serviced by remote caches and would be converted by store-update into misses serviced by a clean copy in memory which reduces the read-latency. With an optimal stride-prefetcher our results show that it beneficient to complement prefetch, with the two techniques here as well.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Overcoming limitations of prefetching in multiprocessors by compiler-initiated coherence actions\",\"authors\":\"J. Skeppstedt\",\"doi\":\"10.1109/PACT.1997.644023\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we first identify limitations of compiler-controlled prefetching in a CC-NUMA multiprocessor with a write-invalidate cache coherence protocol. Compiler-controlled prefetch techniques for CC-NUMAs often are focused only, on stride-accesses, and this introduces a major limitation. We consider combining prefetch with two other compiler-controlled techniques to partly remedy the situation: (1) load-exclusive to reduce write-latency and (2) store-update to reduce read-latency. The purpose of each of these techniques in a machine with prefetch is to let them reduce latency for accesses which the prefetch technique could not handle. We evaluate two different scenarios, firstly with a hybrid compiler/hardware prefetch technique and secondly with an optimal stride-prefetcher. We find that the combined gains under the hybrid prefetch technique are significant for six applications we have studied: in average, 71% of the original write-stall time remains after using the hybrid prefetcher, and of these ownership-requests, 60% would be eliminated using load-exclusive; in average, 68% of the read-stall time remains after using the hybrid prefetcher and of these read-misses, 34% were serviced by remote caches and would be converted by store-update into misses serviced by a clean copy in memory which reduces the read-latency. With an optimal stride-prefetcher our results show that it beneficient to complement prefetch, with the two techniques here as well.\",\"PeriodicalId\":177411,\"journal\":{\"name\":\"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1997-11-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PACT.1997.644023\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.1997.644023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Overcoming limitations of prefetching in multiprocessors by compiler-initiated coherence actions
In this paper we first identify limitations of compiler-controlled prefetching in a CC-NUMA multiprocessor with a write-invalidate cache coherence protocol. Compiler-controlled prefetch techniques for CC-NUMAs often are focused only, on stride-accesses, and this introduces a major limitation. We consider combining prefetch with two other compiler-controlled techniques to partly remedy the situation: (1) load-exclusive to reduce write-latency and (2) store-update to reduce read-latency. The purpose of each of these techniques in a machine with prefetch is to let them reduce latency for accesses which the prefetch technique could not handle. We evaluate two different scenarios, firstly with a hybrid compiler/hardware prefetch technique and secondly with an optimal stride-prefetcher. We find that the combined gains under the hybrid prefetch technique are significant for six applications we have studied: in average, 71% of the original write-stall time remains after using the hybrid prefetcher, and of these ownership-requests, 60% would be eliminated using load-exclusive; in average, 68% of the read-stall time remains after using the hybrid prefetcher and of these read-misses, 34% were serviced by remote caches and would be converted by store-update into misses serviced by a clean copy in memory which reduces the read-latency. With an optimal stride-prefetcher our results show that it beneficient to complement prefetch, with the two techniques here as well.