K. Velusamy, Thomas B. Rolinger, Janice O. McMahon
{"title":"Performance Strategies for Parallel Bitonic Sort on a Migratory Thread Architecture","authors":"K. Velusamy, Thomas B. Rolinger, Janice O. McMahon","doi":"10.1109/HPEC43674.2020.9286172","DOIUrl":null,"url":null,"abstract":"Large-scale data analytics often represent vast amounts of sparse data as a graph. As a result, the underlying kernels in data analytics can be reduced down to operations over graphs, such as searches and traversals. Graph algorithms are notoriously difficult to implement for high performance due to the irregular nature of their memory access patterns, resulting in poor utilization of a traditional cache memory hierarchy. As a result, new architectures have been proposed that specifically target irregular applications. One example is the cache-less Emu migratory thread architecture developed by Lucata Technology. While it is important to evaluate and understand irregular applications on a system such as Emu, it is equally important to explore applications which are not irregular themselves, but are often used as key pre-processing steps in irregular applications. Sorting a list of values is one such pre-processing step, as well as one of the fundamental operations in data analytics. In this paper, we extend our prior preliminary evaluation of parallel bitonic sort on the Emu architecture. We explore different performance strategies for bitonic sort by leveraging the unique features of Emu. In doing so, we implement three significant capabilities into bitonic sort: a smart data layout that periodically remaps data to avoid remote accesses, efficient thread spawning strategies, and adaptive loop parallelization to achieve proper load balancing over time. We present a performance evaluation that demonstrates speed-ups as much as 14.26x by leveraging these capabilities.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC43674.2020.9286172","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Large-scale data analytics often represent vast amounts of sparse data as a graph. As a result, the underlying kernels in data analytics can be reduced down to operations over graphs, such as searches and traversals. Graph algorithms are notoriously difficult to implement for high performance due to the irregular nature of their memory access patterns, resulting in poor utilization of a traditional cache memory hierarchy. As a result, new architectures have been proposed that specifically target irregular applications. One example is the cache-less Emu migratory thread architecture developed by Lucata Technology. While it is important to evaluate and understand irregular applications on a system such as Emu, it is equally important to explore applications which are not irregular themselves, but are often used as key pre-processing steps in irregular applications. Sorting a list of values is one such pre-processing step, as well as one of the fundamental operations in data analytics. In this paper, we extend our prior preliminary evaluation of parallel bitonic sort on the Emu architecture. We explore different performance strategies for bitonic sort by leveraging the unique features of Emu. In doing so, we implement three significant capabilities into bitonic sort: a smart data layout that periodically remaps data to avoid remote accesses, efficient thread spawning strategies, and adaptive loop parallelization to achieve proper load balancing over time. We present a performance evaluation that demonstrates speed-ups as much as 14.26x by leveraging these capabilities.