Nadesh Ramanathan, John Wickerson, F. Winterstein, G. Constantinides
{"title":"A Case for Work-stealing on FPGAs with OpenCL Atomics","authors":"Nadesh Ramanathan, John Wickerson, F. Winterstein, G. Constantinides","doi":"10.1145/2847263.2847343","DOIUrl":null,"url":null,"abstract":"We provide a case study of work-stealing, a popular method for run-time load balancing, on FPGAs. Following the Cederman-Tsigas implementation for GPUs, we synchronize work-items not with locks, mutexes or critical sections, but instead with the atomic operations provided by Altera's OpenCL SDK. We evaluate work-stealing for FPGAs by synthesizing a K-means clustering algorithm on an Altera P385 D5 board, both with work-stealing and with a statically-partitioned load. When block RAM utilization is maximised in both cases, we find that work-stealing leads to a 1.5x speedup. This demonstrates that the ability to do load balancing at run-time can outweigh the drawback of using `expensive' atomics on FPGAs. We hope that our case study will stimulate further research into the high-level synthesis of fine-grained, lock-free, concurrent programs.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2847263.2847343","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 36
Abstract
We provide a case study of work-stealing, a popular method for run-time load balancing, on FPGAs. Following the Cederman-Tsigas implementation for GPUs, we synchronize work-items not with locks, mutexes or critical sections, but instead with the atomic operations provided by Altera's OpenCL SDK. We evaluate work-stealing for FPGAs by synthesizing a K-means clustering algorithm on an Altera P385 D5 board, both with work-stealing and with a statically-partitioned load. When block RAM utilization is maximised in both cases, we find that work-stealing leads to a 1.5x speedup. This demonstrates that the ability to do load balancing at run-time can outweigh the drawback of using `expensive' atomics on FPGAs. We hope that our case study will stimulate further research into the high-level synthesis of fine-grained, lock-free, concurrent programs.