ARM NEON 上的混合矢量化合并排序

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-06 DOI:arxiv-2409.03970

Jincheng Zhou, Jin Zhang, Xiang Zhang, Tiaojie Xiao, Di Ma, Chunye Gong

{"title":"ARM NEON 上的混合矢量化合并排序","authors":"Jincheng Zhou, Jin Zhang, Xiang Zhang, Tiaojie Xiao, Di Ma, Chunye Gong","doi":"arxiv-2409.03970","DOIUrl":null,"url":null,"abstract":"Sorting algorithms are the most extensively researched topics in computer\nscience and serve for numerous practical applications. Although various sorts\nhave been proposed for efficiency, different architectures offer distinct\nflavors to the implementation of parallel sorting. In this paper, we propose a\nhybrid vectorized merge sort on ARM NEON, named NEON Merge Sort for short\n(NEON-MS). In detail, according to the granted register functions, we first\nidentify the optimal register number to avoid the register-to-memory access,\ndue to the write-back of intermediate outcomes. More importantly, following the\ngeneric merge sort framework that primarily uses sorting network for column\nsort and merging networks for three types of vectorized merge, we further\nimprove their structures for high efficiency in an unified asymmetry way: 1) it\nmakes the optimal sorting networks with few comparators become possible; 2)\nhybrid implementation of both serial and vectorized merges incurs the pipeline\nwith merge instructions highly interleaved. Experiments on a single FT2000+\ncore show that NEON-MS is 3.8 and 2.1 times faster than std::sort and\nboost::block\\_sort, respectively, on average. Additionally, as compared to the\nparallel version of the latter, NEON-MS gains an average speedup of 1.25.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"71 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Hybrid Vectorized Merge Sort on ARM NEON\",\"authors\":\"Jincheng Zhou, Jin Zhang, Xiang Zhang, Tiaojie Xiao, Di Ma, Chunye Gong\",\"doi\":\"arxiv-2409.03970\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sorting algorithms are the most extensively researched topics in computer\\nscience and serve for numerous practical applications. Although various sorts\\nhave been proposed for efficiency, different architectures offer distinct\\nflavors to the implementation of parallel sorting. In this paper, we propose a\\nhybrid vectorized merge sort on ARM NEON, named NEON Merge Sort for short\\n(NEON-MS). In detail, according to the granted register functions, we first\\nidentify the optimal register number to avoid the register-to-memory access,\\ndue to the write-back of intermediate outcomes. More importantly, following the\\ngeneric merge sort framework that primarily uses sorting network for column\\nsort and merging networks for three types of vectorized merge, we further\\nimprove their structures for high efficiency in an unified asymmetry way: 1) it\\nmakes the optimal sorting networks with few comparators become possible; 2)\\nhybrid implementation of both serial and vectorized merges incurs the pipeline\\nwith merge instructions highly interleaved. Experiments on a single FT2000+\\ncore show that NEON-MS is 3.8 and 2.1 times faster than std::sort and\\nboost::block\\\\_sort, respectively, on average. Additionally, as compared to the\\nparallel version of the latter, NEON-MS gains an average speedup of 1.25.\",\"PeriodicalId\":501422,\"journal\":{\"name\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"volume\":\"71 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.03970\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.03970","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

排序算法是计算机科学中研究最为广泛的课题，并在许多实际应用中发挥着作用。虽然为了提高效率，人们提出了各种排序方法，但不同的架构为并行排序的实现提供了不同的风格。本文提出了一种基于 ARM NEON 的混合矢量化合并排序方法，简称 NEON 合并排序（NEON-MS）。具体来说，我们首先根据所赋予的寄存器功能确定最佳寄存器编号，以避免由于回写中间结果而造成的寄存器到内存的访问。更重要的是，在通用合并排序框架（主要用于列排序的排序网络和用于三种矢量化合并的合并网络）的基础上，我们进一步改进了它们的结构，以统一的非对称方式实现高效率：1）这使得使用较少比较器的最优排序网络成为可能；2）串行合并和矢量化合并的混合实现产生了合并指令高度交错的流水线。在单个 FT2000+ 核上进行的实验表明，NEON-MS 的平均速度分别是 std::sort 和 boost::block\_sort 的 3.8 倍和 2.1 倍。此外，与后者的并行版本相比，NEON-MS 的平均速度提高了 1.25 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Hybrid Vectorized Merge Sort on ARM NEON

Sorting algorithms are the most extensively researched topics in computer science and serve for numerous practical applications. Although various sorts have been proposed for efficiency, different architectures offer distinct flavors to the implementation of parallel sorting. In this paper, we propose a hybrid vectorized merge sort on ARM NEON, named NEON Merge Sort for short (NEON-MS). In detail, according to the granted register functions, we first identify the optimal register number to avoid the register-to-memory access, due to the write-back of intermediate outcomes. More importantly, following the generic merge sort framework that primarily uses sorting network for column sort and merging networks for three types of vectorized merge, we further improve their structures for high efficiency in an unified asymmetry way: 1) it makes the optimal sorting networks with few comparators become possible; 2) hybrid implementation of both serial and vectorized merges incurs the pipeline with merge instructions highly interleaved. Experiments on a single FT2000+ core show that NEON-MS is 3.8 and 2.1 times faster than std::sort and boost::block\_sort, respectively, on average. Additionally, as compared to the parallel version of the latter, NEON-MS gains an average speedup of 1.25.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Distributed, Parallel, and Cluster Computing

自引率

0.00%

发文量