{"title":"Integrating Cache Oblivious Approach with Modern Processor Architecture: The Case of Floyd-Warshall Algorithm","authors":"Toshio Endo","doi":"10.1145/3368474.3368477","DOIUrl":null,"url":null,"abstract":"In order to implement algorithms on processors with deep cache hierarchy, the cache oblivious approach, which is based on recursive divide and conquer, is considered to be promising. This paper focuses on single-node implementation of Floyd-Warshall (FW) algorithm, which is an important graph computation kernel. For higher performance, another facility of modern processors, SIMD instructions need to be integrated to recursive approach efficiently. This paper describes a methodology to construct recursive implementations that takes architecture with SIMD and multi-core into account while harnessing cache. The experiment shows our FW implementation exhibits around 1.1 TFlops on a dual-socket SkyLake machine and 700 GFlops on a Xeon Phi machine, both of which have AVX512 SIMD ISA.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3368474.3368477","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
In order to implement algorithms on processors with deep cache hierarchy, the cache oblivious approach, which is based on recursive divide and conquer, is considered to be promising. This paper focuses on single-node implementation of Floyd-Warshall (FW) algorithm, which is an important graph computation kernel. For higher performance, another facility of modern processors, SIMD instructions need to be integrated to recursive approach efficiently. This paper describes a methodology to construct recursive implementations that takes architecture with SIMD and multi-core into account while harnessing cache. The experiment shows our FW implementation exhibits around 1.1 TFlops on a dual-socket SkyLake machine and 700 GFlops on a Xeon Phi machine, both of which have AVX512 SIMD ISA.