Versatile and scalable parallel histogram construction

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI:10.1145/2628071.2628108

Wookeun Jung, Jongsoo Park, Jaejin Lee

{"title":"Versatile and scalable parallel histogram construction","authors":"Wookeun Jung, Jongsoo Park, Jaejin Lee","doi":"10.1145/2628071.2628108","DOIUrl":null,"url":null,"abstract":"Histograms are used in various fields to quickly profile the distribution of a large amount of data. However, it is challenging to efficiently utilize abundant parallel resources in modern processors for histogram construction. To make matters worse, the most efficient implementation varies depending on input parameters (e.g., input distribution, number of bins, and data type) or architecture parameters (e.g., cache capacity and SIMD width). This paper presents versatile histogram methods that achiev competitive performance across a wide range of input types and target architectures. Our open source implementations are highly optimized for various cases and are scalable for more threads and wider SIMD units. We also show that histogram construction can be significantly accelerated by Intel® Xeon Phi coprocessors for common input data sets because of their compute power from many cores and instructions for efficient vectorization, such as gather-scatter. For histograms with 256 fixed-width bins, a dual-socket 8-core Intel® Xeon® E5-2690 achieves 13 billion bin updates per second (GUPS), while a 60-core Intel® Xeon Phi 5110P coprocessor achieves 18 GUPS for a skewed input. For histograms with 256 variable-width bins, the Xeon processor achieves 4.7 GUPS, while the Xeon Phi coprocessor achieves 9.7 GUPS for a skewed input. For text histogram, or word count, the Xeon processor achieves 342.4 million words per seconds (MWPS). This is 4.12×, 3.46× faster than PHOENIX and TBB. The Xeon phi processor achieves 401.4 MWPS, which is 1.17× faster than the Xeon processor. Since histogram construction captures essential characteristics of more general reduction-heavy operations, our approach can be extended to other settings.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2628071.2628108","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

Histograms are used in various fields to quickly profile the distribution of a large amount of data. However, it is challenging to efficiently utilize abundant parallel resources in modern processors for histogram construction. To make matters worse, the most efficient implementation varies depending on input parameters (e.g., input distribution, number of bins, and data type) or architecture parameters (e.g., cache capacity and SIMD width). This paper presents versatile histogram methods that achiev competitive performance across a wide range of input types and target architectures. Our open source implementations are highly optimized for various cases and are scalable for more threads and wider SIMD units. We also show that histogram construction can be significantly accelerated by Intel® Xeon Phi coprocessors for common input data sets because of their compute power from many cores and instructions for efficient vectorization, such as gather-scatter. For histograms with 256 fixed-width bins, a dual-socket 8-core Intel® Xeon® E5-2690 achieves 13 billion bin updates per second (GUPS), while a 60-core Intel® Xeon Phi 5110P coprocessor achieves 18 GUPS for a skewed input. For histograms with 256 variable-width bins, the Xeon processor achieves 4.7 GUPS, while the Xeon Phi coprocessor achieves 9.7 GUPS for a skewed input. For text histogram, or word count, the Xeon processor achieves 342.4 million words per seconds (MWPS). This is 4.12×, 3.46× faster than PHOENIX and TBB. The Xeon phi processor achieves 401.4 MWPS, which is 1.17× faster than the Xeon processor. Since histogram construction captures essential characteristics of more general reduction-heavy operations, our approach can be extended to other settings.

查看原文本刊更多论文

通用和可扩展的并行直方图结构

直方图用于各个领域，以快速描述大量数据的分布。然而，如何有效地利用现代处理器中丰富的并行资源来构建直方图是一个挑战。更糟糕的是，最有效的实现取决于输入参数(例如，输入分布、箱数和数据类型)或架构参数(例如，缓存容量和SIMD宽度)。本文提出了多种直方图方法，可在广泛的输入类型和目标架构中实现具有竞争力的性能。我们的开源实现针对各种情况进行了高度优化，并且可扩展到更多线程和更宽的SIMD单元。我们还表明，对于公共输入数据集，英特尔®Xeon Phi协处理器可以显着加速直方图构建，因为它们具有来自许多核心的计算能力和有效矢量化指令，例如收集-分散。对于具有256个固定宽度箱的直方图，双插槽8核Intel®Xeon®E5-2690实现每秒130亿箱更新(GUPS)，而60核Intel®Xeon Phi 5110P协处理器为倾斜输入实现18 GUPS。对于具有256个可变宽度箱的直方图，Xeon处理器实现4.7 GUPS，而Xeon Phi协处理器在倾斜输入时实现9.7 GUPS。对于文本直方图或单词计数，至强处理器可以达到每秒3.424亿单词(MWPS)。这比PHOENIX和TBB快4.12倍，3.46倍。Xeon phi处理器达到401.4 MWPS，比Xeon处理器快1.17倍。由于直方图构造捕获了更一般的重约操作的基本特征，因此我们的方法可以扩展到其他设置。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 23rd International Conference on Parallel Architecture and Compilation (PACT)

自引率

0.00%

发文量