{"title":"传递闭包的缓存友好实现","authors":"M. Penner, V. K. Prasanna","doi":"10.1109/PACT.2001.953299","DOIUrl":null,"url":null,"abstract":"We show cache friendly implementations of the Floyd-Warshall algorithm for the all-pairs shortest-path problem. We first compare the best commercial compiler optimizations available with standard cache-friendly optimizations and a simple improvement involving a block layout, which reduces TLB misses. We show approximately 15% improvements using these optimizations. We also develop a general representation, the unidirectional space time representation, which can be used to generate cache friendly implementations for a large class of algorithms. We show analytically and experimentally that this representation can be used to minimize level-1 and level-2 cache misses and TLB misses and therefore exhibits the best overall performance. Using this representation we show a 2/spl times/ improvement in performance with respect to the compiler optimized implementation. Experiments were conducted on Pentium III, Alpha, and MIPS R12000 machines using problem sizes between 1024 and 2048 vertices. We used the Simplescalar simulator to demonstrate improved cache performance.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Cache-friendly implementations of transitive closure\",\"authors\":\"M. Penner, V. K. Prasanna\",\"doi\":\"10.1109/PACT.2001.953299\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We show cache friendly implementations of the Floyd-Warshall algorithm for the all-pairs shortest-path problem. We first compare the best commercial compiler optimizations available with standard cache-friendly optimizations and a simple improvement involving a block layout, which reduces TLB misses. We show approximately 15% improvements using these optimizations. We also develop a general representation, the unidirectional space time representation, which can be used to generate cache friendly implementations for a large class of algorithms. We show analytically and experimentally that this representation can be used to minimize level-1 and level-2 cache misses and TLB misses and therefore exhibits the best overall performance. Using this representation we show a 2/spl times/ improvement in performance with respect to the compiler optimized implementation. Experiments were conducted on Pentium III, Alpha, and MIPS R12000 machines using problem sizes between 1024 and 2048 vertices. We used the Simplescalar simulator to demonstrate improved cache performance.\",\"PeriodicalId\":276650,\"journal\":{\"name\":\"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2001-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PACT.2001.953299\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2001.953299","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
摘要
我们展示了对全对最短路径问题的Floyd-Warshall算法的缓存友好实现。我们首先将最佳商业编译器优化与标准缓存友好型优化和涉及块布局的简单改进进行比较,后者减少了TLB缺失。使用这些优化,我们显示了大约15%的改进。我们还开发了一种通用的表示,即单向时空表示,它可用于为大类算法生成缓存友好的实现。我们通过分析和实验表明,这种表示可以用来最小化一级和二级缓存丢失和TLB丢失,因此表现出最佳的整体性能。使用这种表示法,我们可以看到相对于编译器优化的实现,性能提高了2/ 1倍。实验在Pentium III, Alpha和MIPS R12000机器上进行,使用1024到2048个顶点之间的问题大小。我们使用Simplescalar模拟器来演示改进的缓存性能。
Cache-friendly implementations of transitive closure
We show cache friendly implementations of the Floyd-Warshall algorithm for the all-pairs shortest-path problem. We first compare the best commercial compiler optimizations available with standard cache-friendly optimizations and a simple improvement involving a block layout, which reduces TLB misses. We show approximately 15% improvements using these optimizations. We also develop a general representation, the unidirectional space time representation, which can be used to generate cache friendly implementations for a large class of algorithms. We show analytically and experimentally that this representation can be used to minimize level-1 and level-2 cache misses and TLB misses and therefore exhibits the best overall performance. Using this representation we show a 2/spl times/ improvement in performance with respect to the compiler optimized implementation. Experiments were conducted on Pentium III, Alpha, and MIPS R12000 machines using problem sizes between 1024 and 2048 vertices. We used the Simplescalar simulator to demonstrate improved cache performance.