{"title":"Versatile and scalable parallel histogram construction","authors":"Wookeun Jung, Jongsoo Park, Jaejin Lee","doi":"10.1145/2628071.2628108","DOIUrl":null,"url":null,"abstract":"Histograms are used in various fields to quickly profile the distribution of a large amount of data. However, it is challenging to efficiently utilize abundant parallel resources in modern processors for histogram construction. To make matters worse, the most efficient implementation varies depending on input parameters (e.g., input distribution, number of bins, and data type) or architecture parameters (e.g., cache capacity and SIMD width). This paper presents versatile histogram methods that achiev competitive performance across a wide range of input types and target architectures. Our open source implementations are highly optimized for various cases and are scalable for more threads and wider SIMD units. We also show that histogram construction can be significantly accelerated by Intel® Xeon Phi coprocessors for common input data sets because of their compute power from many cores and instructions for efficient vectorization, such as gather-scatter. For histograms with 256 fixed-width bins, a dual-socket 8-core Intel® Xeon® E5-2690 achieves 13 billion bin updates per second (GUPS), while a 60-core Intel® Xeon Phi 5110P coprocessor achieves 18 GUPS for a skewed input. For histograms with 256 variable-width bins, the Xeon processor achieves 4.7 GUPS, while the Xeon Phi coprocessor achieves 9.7 GUPS for a skewed input. For text histogram, or word count, the Xeon processor achieves 342.4 million words per seconds (MWPS). This is 4.12×, 3.46× faster than PHOENIX and TBB. The Xeon phi processor achieves 401.4 MWPS, which is 1.17× faster than the Xeon processor. Since histogram construction captures essential characteristics of more general reduction-heavy operations, our approach can be extended to other settings.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2628071.2628108","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Abstract
Histograms are used in various fields to quickly profile the distribution of a large amount of data. However, it is challenging to efficiently utilize abundant parallel resources in modern processors for histogram construction. To make matters worse, the most efficient implementation varies depending on input parameters (e.g., input distribution, number of bins, and data type) or architecture parameters (e.g., cache capacity and SIMD width). This paper presents versatile histogram methods that achiev competitive performance across a wide range of input types and target architectures. Our open source implementations are highly optimized for various cases and are scalable for more threads and wider SIMD units. We also show that histogram construction can be significantly accelerated by Intel® Xeon Phi coprocessors for common input data sets because of their compute power from many cores and instructions for efficient vectorization, such as gather-scatter. For histograms with 256 fixed-width bins, a dual-socket 8-core Intel® Xeon® E5-2690 achieves 13 billion bin updates per second (GUPS), while a 60-core Intel® Xeon Phi 5110P coprocessor achieves 18 GUPS for a skewed input. For histograms with 256 variable-width bins, the Xeon processor achieves 4.7 GUPS, while the Xeon Phi coprocessor achieves 9.7 GUPS for a skewed input. For text histogram, or word count, the Xeon processor achieves 342.4 million words per seconds (MWPS). This is 4.12×, 3.46× faster than PHOENIX and TBB. The Xeon phi processor achieves 401.4 MWPS, which is 1.17× faster than the Xeon processor. Since histogram construction captures essential characteristics of more general reduction-heavy operations, our approach can be extended to other settings.