{"title":"多核和多核处理器扫描算法的优化","authors":"Qiao Sun, Chao Yang","doi":"10.1109/HiPC.2014.7116883","DOIUrl":null,"url":null,"abstract":"Scan is a basic building block widely utilized in many applications. With the emergence of multi-core and many-core processors, the study of highly scalable parallel scan algorithms becomes increasingly important. In this paper, we first propose a novel parallel scan algorithm based on the fine grain dynamic task scheduling in QUARK, and then derive a cache-friendly framework for any parallel scan kernel. The QUARK-scan is superior to the fastest available counterpart proposed by Zhang in 2012 and many other parallel scans in several aspects, including the greatly improved load balance and the substantially reduced number of global barriers. On the other hand, the cache-friendly framework helps in improving the cache line usage and is flexible to apply to any parallel scan kernel. A variety of optimization techniques such as SIMD vectorization, loop unrolling, adjacent synchronization and thread affinity are exploited in QUARKscan and the cache-friendly versions of both QUARK-scan and Zhang's scan. Experiments done on three typical multi- and many-core platforms indicate that the proposed QUARK-scan and the cache-friendly Zhang's scan are superior in different scenarios.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"165 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Optimization of scan algorithms on multi- and many-core processors\",\"authors\":\"Qiao Sun, Chao Yang\",\"doi\":\"10.1109/HiPC.2014.7116883\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Scan is a basic building block widely utilized in many applications. With the emergence of multi-core and many-core processors, the study of highly scalable parallel scan algorithms becomes increasingly important. In this paper, we first propose a novel parallel scan algorithm based on the fine grain dynamic task scheduling in QUARK, and then derive a cache-friendly framework for any parallel scan kernel. The QUARK-scan is superior to the fastest available counterpart proposed by Zhang in 2012 and many other parallel scans in several aspects, including the greatly improved load balance and the substantially reduced number of global barriers. On the other hand, the cache-friendly framework helps in improving the cache line usage and is flexible to apply to any parallel scan kernel. A variety of optimization techniques such as SIMD vectorization, loop unrolling, adjacent synchronization and thread affinity are exploited in QUARKscan and the cache-friendly versions of both QUARK-scan and Zhang's scan. Experiments done on three typical multi- and many-core platforms indicate that the proposed QUARK-scan and the cache-friendly Zhang's scan are superior in different scenarios.\",\"PeriodicalId\":337777,\"journal\":{\"name\":\"2014 21st International Conference on High Performance Computing (HiPC)\",\"volume\":\"165 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 21st International Conference on High Performance Computing (HiPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HiPC.2014.7116883\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 21st International Conference on High Performance Computing (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2014.7116883","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Optimization of scan algorithms on multi- and many-core processors
Scan is a basic building block widely utilized in many applications. With the emergence of multi-core and many-core processors, the study of highly scalable parallel scan algorithms becomes increasingly important. In this paper, we first propose a novel parallel scan algorithm based on the fine grain dynamic task scheduling in QUARK, and then derive a cache-friendly framework for any parallel scan kernel. The QUARK-scan is superior to the fastest available counterpart proposed by Zhang in 2012 and many other parallel scans in several aspects, including the greatly improved load balance and the substantially reduced number of global barriers. On the other hand, the cache-friendly framework helps in improving the cache line usage and is flexible to apply to any parallel scan kernel. A variety of optimization techniques such as SIMD vectorization, loop unrolling, adjacent synchronization and thread affinity are exploited in QUARKscan and the cache-friendly versions of both QUARK-scan and Zhang's scan. Experiments done on three typical multi- and many-core platforms indicate that the proposed QUARK-scan and the cache-friendly Zhang's scan are superior in different scenarios.