{"title":"Combining loop fusion with prefetching on shared-memory multiprocessors","authors":"N. Manjikian","doi":"10.1109/ICPP.1997.622560","DOIUrl":null,"url":null,"abstract":"The performance of programs consisting of parallel loops on shared-memory multiprocessors is limited by long memory latencies as processor speeds increase more rapidly than memory speeds. Two complementary techniques for addressing memory latency and improving performance are: (a) cache locality enhancement for latency reduction and (b) data prefetching for latency tolerance. This paper studies the benefit of combining loop fusion for locality enhancement with prefetching. Experimental results are reported for multiprocessors with support for prefetching. For a complete application on an SGI Power Challenge R10000, combining loop fusion with prefetching improves parallel speedup by 46%.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.1997.622560","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15
Abstract
The performance of programs consisting of parallel loops on shared-memory multiprocessors is limited by long memory latencies as processor speeds increase more rapidly than memory speeds. Two complementary techniques for addressing memory latency and improving performance are: (a) cache locality enhancement for latency reduction and (b) data prefetching for latency tolerance. This paper studies the benefit of combining loop fusion for locality enhancement with prefetching. Experimental results are reported for multiprocessors with support for prefetching. For a complete application on an SGI Power Challenge R10000, combining loop fusion with prefetching improves parallel speedup by 46%.
在共享内存多处理器上由并行循环组成的程序的性能受到长内存延迟的限制,因为处理器速度比内存速度增长得更快。解决内存延迟和提高性能的两种互补技术是:(a)缓存局域性增强以减少延迟;(b)数据预取以容忍延迟。研究了环融合局部增强与预取相结合的优点。报道了支持预取的多处理器的实验结果。对于SGI Power Challenge R10000上的完整应用程序,将环路融合与预取相结合可将并行加速提高46%。