2016 45th International Conference on Parallel Processing (ICPP)最新文献_第2页

Help-Optimal and Language-Portable Lock-Free Concurrent Data Structures 帮助优化和语言可移植无锁并发数据结构

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.48

Bapi Chatterjee, Ivan Walulya, P. Tsigas

{"title":"Help-Optimal and Language-Portable Lock-Free Concurrent Data Structures","authors":"Bapi Chatterjee, Ivan Walulya, P. Tsigas","doi":"10.1109/ICPP.2016.48","DOIUrl":"https://doi.org/10.1109/ICPP.2016.48","url":null,"abstract":"Helping is a widely used technique to guarantee lock-freedom in many concurrent data structures. An optimized helping strategy improves the overall performance of a lock-free algorithm. In this paper, we propose help-optimality, which essentially implies that no operation step is accounted for exclusive helping in the lock-free synchronization of concurrent operations. To describe the concept, we revisit the designs of a lock-free linked-list and a lock-free binary search tree and present improved algorithms. Our algorithms employ atomic single-word compare-and-swap (CAS) primitives and are linearizable. We design the algorithms without using any language/platformspecific mechanism. Specifically, we use neither bit-stealing froma pointer nor runtime type introspection of objects. Thus, our algorithms are language-portable. Further, to optimize the amortized number of steps per operation, if a CAS execution tomodify a shared pointer fails, we obtain a fresh set of thread-local variables without restarting an operation from scratch. We use several micro-benchmarks in both C/C++ and Java to validate the efficiency of our algorithms against existing state-of-the-art. The experiments show that the algorithms are scalable. Our implementations perform on a par with highly optimizedones and in many cases yield 10%-50% higher throughput.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131275835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

RegTT: Accelerating Tree Traversals on GPUs by Exploiting Regularities 利用规律加速gpu上的树遍历

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.71

Feng Zhang, Peng Di, Hao Zhou, Xiangke Liao, Jingling Xue

{"title":"RegTT: Accelerating Tree Traversals on GPUs by Exploiting Regularities","authors":"Feng Zhang, Peng Di, Hao Zhou, Xiangke Liao, Jingling Xue","doi":"10.1109/ICPP.2016.71","DOIUrl":"https://doi.org/10.1109/ICPP.2016.71","url":null,"abstract":"Tree traversals are widely used irregular applications. Given a tree traversal algorithm, where a single tree is traversed by multiple queries (with truncation), its efficient parallelization on GPUs is hindered by branch divergence, load imbalance and memory-access irregularity, as the nodes and their visitation orders differ greatly under different queries. We leverage a key insight made on several truncation-induced tree traversal regularities to enable as many threads in the same warp as possible to visit the same node simultaneously, thereby enhancing both GPU resource utilization and memory coalescing at the same time. We introduce a new parallelization approach, RegTT, to orchestrate an efficient execution of a tree traversal algorithm on GPUs by starting with BFT (Breadth-First Traversal), then reordering the queries being processed (based on their truncation histories), and finally, switching to DFT (Depth-First Traversal). RegTT is general (without relying on domain-specific knowledge) and automatic (as a source-code transformation). For a set of five representative benchmarks used, RegTT outperforms the state-of-the-art by 1.66x on average.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116296325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Making In-Memory Frequent Pattern Mining Durable and Energy Efficient 使内存中频繁模式挖掘持久且节能

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.13

Yi Lin, Po-Chun Huang, Duo Liu, Xiao Zhu, Liang Liang

{"title":"Making In-Memory Frequent Pattern Mining Durable and Energy Efficient","authors":"Yi Lin, Po-Chun Huang, Duo Liu, Xiao Zhu, Liang Liang","doi":"10.1109/ICPP.2016.13","DOIUrl":"https://doi.org/10.1109/ICPP.2016.13","url":null,"abstract":"It is a significant problem to efficiently identifythe frequently-occurring patterns in a given dataset, so as tounveil the trends hidden behind the dataset. This work ismotivated by the serious demands of a high-performance inmemoryfrequent-pattern mining strategy, with joint optimizationover the mining performance and system durability. While thewidely-used frequent-pattern tree (FP-tree) serves as an efficientapproach for frequent-pattern mining, its construction procedureoften makes it unfriendly for nonvolatile memories (NVMs). Inparticular, the incremental construction of FP-tree could generatemany unnecessary writes to the NVM and greatly degrade theenergy efficiency, because NVM writes typically take more timeand energy than reads. To overcome the drawbacks of FP-treeon NVMs, this paper proposes evergreen FP-tree (EvFP-tree), which includes a lazy counter and a minimum-bit-altered (MBA) encoding scheme to make FP-tree friendly for NVMs. The basicidea of the lazy counter is to greatly eliminate the redundantwrites generated in FP-tree construction. On the other hand, theMBA encoding scheme is to complement existing wear-levelingtechniques to evenly write each memory cell to extend the NVMlifetime. As verified by experiments, EvFP-tree greatly enhancesthe mining performance and system lifetime by 28.01% and82.10% on average, respectively.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121912849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

A Comparison of Accelerator Architectures for Radio-Astronomical Signal-Processing Algorithms 射电天文信号处理算法的加速器结构比较

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.62

J. Romein

引用次数: 8

Performance Analysis of GPU-Based Convolutional Neural Networks 基于gpu的卷积神经网络性能分析

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.15

Xiaqing Li, Guangyan Zhang, H. Howie Huang, Zhufan Wang, Weimin Zheng

引用次数: 109

High Performance MPI Library for Container-Based HPC Cloud on InfiniBand Clusters 基于ib集群的容器型HPC云高性能MPI库

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.38

Jie Zhang, Xiaoyi Lu, D. Panda

{"title":"High Performance MPI Library for Container-Based HPC Cloud on InfiniBand Clusters","authors":"Jie Zhang, Xiaoyi Lu, D. Panda","doi":"10.1109/ICPP.2016.38","DOIUrl":"https://doi.org/10.1109/ICPP.2016.38","url":null,"abstract":"Virtualization technology has grown rapidly over the past few decades. As a lightweight solution, container-based virtualization provides a promising approach to efficiently build HPC clouds. However, our study shows clear performance bottleneck when running MPI jobs on multi-container environments. This motivates us to first analyze the performance bottleneck for MPI jobs running in different container deployment scenarios. To eliminate performance bottleneck, we propose a high performance locality-aware MPI library, which is able to dynamically detect co-resident containers at runtime. Through this design, the MPI processes in co-resident containers can communicate to each other by shared memory and Cross Memory Attach (CMA) channels instead of the network channel. A comprehensive performance study indicates that compared with the default case, our proposed design can significantly improve the communication performance by up to 9X and 86% in terms of MPI point-to-point and collective operations, respectively. The results for applications demonstrate that the locality-aware design can reduce up to 16% of execution time. The evaluation results also show that by the help of locality-aware design, we can achieve near-native performance in container-based HPC cloud with minor overhead. The proposed locality-aware MPI design reveals significant potential to be utilized to efficiently build large scale container-based HPC clouds.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133400766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Parallel Tree Traversal for Nearest Neighbor Query on the GPU GPU上最近邻查询的并行树遍历

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.20

Moohyeon Nam, Jinwoong Kim, Beomseok Nam

{"title":"Parallel Tree Traversal for Nearest Neighbor Query on the GPU","authors":"Moohyeon Nam, Jinwoong Kim, Beomseok Nam","doi":"10.1109/ICPP.2016.20","DOIUrl":"https://doi.org/10.1109/ICPP.2016.20","url":null,"abstract":"The similarity search problem is found in many application domains including computer graphics, information retrieval, statistics, computational biology, and scientific data processing just to name a few. Recently several studies have been performed to accelerate the k-nearest neighbor (kNN) queries using GPUs, but most of the works develop brute-force exhaustive scanning algorithms leveraging a large number of GPU cores and none of the prior works employ GPUs for an n-ary tree structured index. It is known that multi-dimensional hierarchical indexing trees such as R-trees are inherently not well suited for GPUs because of their irregular tree traversal and memory access patterns. Traversing hierarchical tree structures in an irregular manner makes it difficult to exploit parallelism since GPUs are tailored for deterministic memory accesses. In this work, we develop a data parallel tree traversal algorithm, Parallel Scan and Backtrack (PSB), for kNN query processing on the GPU, this algorithm traverses a multi-dimensional tree structured index while avoiding warp divergence problems. In order to take advantage of accessing contiguous memory blocks, the proposed PSB algorithm performs linear scanning of sibling leaf nodes, which increases the chance to optimize the parallel SIMD algorithm. We evaluate the performance of the PSB algorithm against the classic branch-and-bound kNN query processing algorithm. Our experiments with real datasets show that the PSB algorithm is faster by a large margin than the branch-and-bound algorithm.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133760671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Partial Flattening: A Compilation Technique for Irregular Nested Parallelism on GPGPUs 部分平坦化:gpgpu上不规则嵌套并行的编译技术

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.70

Ming-Hsiang Huang, Wuu Yang

{"title":"Partial Flattening: A Compilation Technique for Irregular Nested Parallelism on GPGPUs","authors":"Ming-Hsiang Huang, Wuu Yang","doi":"10.1109/ICPP.2016.70","DOIUrl":"https://doi.org/10.1109/ICPP.2016.70","url":null,"abstract":"Supporting irregular nested parallelism on modern GPUs requires much effort. One should distribute the parallel tasks evenly while preserving reasonable memory usage. Moreover, the task distribution should also fit the thread hierarchy of the underlying GPU to fully exploit its computing power. We propose partial flattening, an automatic code transformation which translates annotated C programs to CUDA kernels. Thread blocks are treated as flat SIMT processors. Iterations are dynamically organized into batches. Batches are executed in a sequential (depth-first) order. A kernel is treated as multiple independent SIMT processors with an additional task-stealing mechanism. Partial flattening allows easy expression of nested parallelism and synchronization by annotating nested parallel loops or parallel-recursive calls, while preserving reasonable memory usage by the depth-first execution order. Our 2-level task distribution scheme does not need special hardware support, and fits well with the CUDA thread hierarchy. Experiments show that partial flattening outperforms NESL significantly in most benchmarks, and obtains 2.15x and 67x speedup over CUDA dynamic parallelism in Quicksort and the Bron-Kerbosch algorithm, respectively.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115055839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

PCAF: Scalable, High Precision k-NN Search Using Principal Component Analysis Based Filtering PCAF:基于主成分分析滤波的可扩展、高精度k-NN搜索

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.79

Huan Feng, D. Eyers, S. Mills, Yongwei Wu, Zhiyi Huang

引用次数: 3

Efficient Virtual Network Embedding for Variable Size Virtual Machines in Fat-Tree Data Centers 胖树数据中心中可变大小虚拟机的高效虚拟网络嵌入

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.8

Jun Duan, Yuanyuan Yang

{"title":"Efficient Virtual Network Embedding for Variable Size Virtual Machines in Fat-Tree Data Centers","authors":"Jun Duan, Yuanyuan Yang","doi":"10.1109/ICPP.2016.8","DOIUrl":"https://doi.org/10.1109/ICPP.2016.8","url":null,"abstract":"Network virtualization is the enabling technology for sharing resources on cloud. The efficiency of virtual network embedding determines the expense and revenue ratio of a data center. In this paper, we consider the virtual network embedding problem in fat-tree data centers. We design various schemes to embed Nonblocking Multicast Virtual Networks (NMVNs) which are dedicated to deliver premium experience to cloud users. In each NMVN, there is a free combination of virtual machines selected from variable sizes. The bottleneck of communications between these virtual machines is removed so that they can always send data at full bandwidth of their network interface, even if data is simultaneously sent to multiple destinations. In addition, the high performance of NMVNs is guaranteed at the wellcontrolled low network hardware cost. We design two embedding schemes for NMVNs, named Static NMVN Embedding (SNE) and Dynamic NMVN Embedding (DNE). Both schemes support the nonblocking properties for multicast. Besides, each of the two schemes has its unique features. The SNE scheme provides an interference-free solution, in the sense that a virtual network is not aware of the existence of other virtual networks during its lifetime. The DNE scheme has lower hardware cost than SNE and provides higher flexibility to cloud users by possible reconfigurations when necessary. Additionally, we show through theoretical analysis and simulations to validate that the overhead of DNE is minimal thus acceptable to most cloud applications.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130230699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3