Performance Strategies for Parallel Bitonic Sort on a Migratory Thread Architecture

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI:10.1109/HPEC43674.2020.9286172

K. Velusamy, Thomas B. Rolinger, Janice O. McMahon

{"title":"Performance Strategies for Parallel Bitonic Sort on a Migratory Thread Architecture","authors":"K. Velusamy, Thomas B. Rolinger, Janice O. McMahon","doi":"10.1109/HPEC43674.2020.9286172","DOIUrl":null,"url":null,"abstract":"Large-scale data analytics often represent vast amounts of sparse data as a graph. As a result, the underlying kernels in data analytics can be reduced down to operations over graphs, such as searches and traversals. Graph algorithms are notoriously difficult to implement for high performance due to the irregular nature of their memory access patterns, resulting in poor utilization of a traditional cache memory hierarchy. As a result, new architectures have been proposed that specifically target irregular applications. One example is the cache-less Emu migratory thread architecture developed by Lucata Technology. While it is important to evaluate and understand irregular applications on a system such as Emu, it is equally important to explore applications which are not irregular themselves, but are often used as key pre-processing steps in irregular applications. Sorting a list of values is one such pre-processing step, as well as one of the fundamental operations in data analytics. In this paper, we extend our prior preliminary evaluation of parallel bitonic sort on the Emu architecture. We explore different performance strategies for bitonic sort by leveraging the unique features of Emu. In doing so, we implement three significant capabilities into bitonic sort: a smart data layout that periodically remaps data to avoid remote accesses, efficient thread spawning strategies, and adaptive loop parallelization to achieve proper load balancing over time. We present a performance evaluation that demonstrates speed-ups as much as 14.26x by leveraging these capabilities.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC43674.2020.9286172","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Large-scale data analytics often represent vast amounts of sparse data as a graph. As a result, the underlying kernels in data analytics can be reduced down to operations over graphs, such as searches and traversals. Graph algorithms are notoriously difficult to implement for high performance due to the irregular nature of their memory access patterns, resulting in poor utilization of a traditional cache memory hierarchy. As a result, new architectures have been proposed that specifically target irregular applications. One example is the cache-less Emu migratory thread architecture developed by Lucata Technology. While it is important to evaluate and understand irregular applications on a system such as Emu, it is equally important to explore applications which are not irregular themselves, but are often used as key pre-processing steps in irregular applications. Sorting a list of values is one such pre-processing step, as well as one of the fundamental operations in data analytics. In this paper, we extend our prior preliminary evaluation of parallel bitonic sort on the Emu architecture. We explore different performance strategies for bitonic sort by leveraging the unique features of Emu. In doing so, we implement three significant capabilities into bitonic sort: a smart data layout that periodically remaps data to avoid remote accesses, efficient thread spawning strategies, and adaptive loop parallelization to achieve proper load balancing over time. We present a performance evaluation that demonstrates speed-ups as much as 14.26x by leveraging these capabilities.

查看原文本刊更多论文

迁移线程结构下并行双音排序的性能策略

大规模数据分析通常将大量稀疏数据表示为图形。因此，数据分析中的底层内核可以简化为对图的操作，例如搜索和遍历。众所周知，图算法很难实现高性能，这是由于其内存访问模式的不规则性，导致传统缓存层次结构的利用率很低。因此，已经提出了专门针对不规则应用程序的新体系结构。一个例子是Lucata Technology开发的无缓存Emu迁移线程架构。虽然评估和理解像Emu这样的系统上的不规则应用程序很重要，但探索那些本身并不是不规则的应用程序也同样重要，这些应用程序通常被用作不规则应用程序中的关键预处理步骤。对值列表进行排序就是这样一个预处理步骤，也是数据分析中的基本操作之一。在本文中，我们扩展了之前在Emu架构上对并行双音排序的初步评估。我们通过利用Emu的独特特性来探索不同的双音排序性能策略。为此，我们在bitonic sort中实现了三个重要功能:智能数据布局，定期重新映射数据以避免远程访问，有效的线程衍生策略，以及自适应循环并行化，以实现适当的负载平衡。我们给出了一个性能评估，通过利用这些功能，速度提升了14.26倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量