2016 45th International Conference on Parallel Processing (ICPP)最新文献

筛选
英文 中文
Help-Optimal and Language-Portable Lock-Free Concurrent Data Structures 帮助优化和语言可移植无锁并发数据结构
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.48
Bapi Chatterjee, Ivan Walulya, P. Tsigas
{"title":"Help-Optimal and Language-Portable Lock-Free Concurrent Data Structures","authors":"Bapi Chatterjee, Ivan Walulya, P. Tsigas","doi":"10.1109/ICPP.2016.48","DOIUrl":"https://doi.org/10.1109/ICPP.2016.48","url":null,"abstract":"Helping is a widely used technique to guarantee lock-freedom in many concurrent data structures. An optimized helping strategy improves the overall performance of a lock-free algorithm. In this paper, we propose help-optimality, which essentially implies that no operation step is accounted for exclusive helping in the lock-free synchronization of concurrent operations. To describe the concept, we revisit the designs of a lock-free linked-list and a lock-free binary search tree and present improved algorithms. Our algorithms employ atomic single-word compare-and-swap (CAS) primitives and are linearizable. We design the algorithms without using any language/platformspecific mechanism. Specifically, we use neither bit-stealing froma pointer nor runtime type introspection of objects. Thus, our algorithms are language-portable. Further, to optimize the amortized number of steps per operation, if a CAS execution tomodify a shared pointer fails, we obtain a fresh set of thread-local variables without restarting an operation from scratch. We use several micro-benchmarks in both C/C++ and Java to validate the efficiency of our algorithms against existing state-of-the-art. The experiments show that the algorithms are scalable. Our implementations perform on a par with highly optimizedones and in many cases yield 10%-50% higher throughput.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131275835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
RegTT: Accelerating Tree Traversals on GPUs by Exploiting Regularities 利用规律加速gpu上的树遍历
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.71
Feng Zhang, Peng Di, Hao Zhou, Xiangke Liao, Jingling Xue
{"title":"RegTT: Accelerating Tree Traversals on GPUs by Exploiting Regularities","authors":"Feng Zhang, Peng Di, Hao Zhou, Xiangke Liao, Jingling Xue","doi":"10.1109/ICPP.2016.71","DOIUrl":"https://doi.org/10.1109/ICPP.2016.71","url":null,"abstract":"Tree traversals are widely used irregular applications. Given a tree traversal algorithm, where a single tree is traversed by multiple queries (with truncation), its efficient parallelization on GPUs is hindered by branch divergence, load imbalance and memory-access irregularity, as the nodes and their visitation orders differ greatly under different queries. We leverage a key insight made on several truncation-induced tree traversal regularities to enable as many threads in the same warp as possible to visit the same node simultaneously, thereby enhancing both GPU resource utilization and memory coalescing at the same time. We introduce a new parallelization approach, RegTT, to orchestrate an efficient execution of a tree traversal algorithm on GPUs by starting with BFT (Breadth-First Traversal), then reordering the queries being processed (based on their truncation histories), and finally, switching to DFT (Depth-First Traversal). RegTT is general (without relying on domain-specific knowledge) and automatic (as a source-code transformation). For a set of five representative benchmarks used, RegTT outperforms the state-of-the-art by 1.66x on average.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116296325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Making In-Memory Frequent Pattern Mining Durable and Energy Efficient 使内存中频繁模式挖掘持久且节能
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.13
Yi Lin, Po-Chun Huang, Duo Liu, Xiao Zhu, Liang Liang
{"title":"Making In-Memory Frequent Pattern Mining Durable and Energy Efficient","authors":"Yi Lin, Po-Chun Huang, Duo Liu, Xiao Zhu, Liang Liang","doi":"10.1109/ICPP.2016.13","DOIUrl":"https://doi.org/10.1109/ICPP.2016.13","url":null,"abstract":"It is a significant problem to efficiently identifythe frequently-occurring patterns in a given dataset, so as tounveil the trends hidden behind the dataset. This work ismotivated by the serious demands of a high-performance inmemoryfrequent-pattern mining strategy, with joint optimizationover the mining performance and system durability. While thewidely-used frequent-pattern tree (FP-tree) serves as an efficientapproach for frequent-pattern mining, its construction procedureoften makes it unfriendly for nonvolatile memories (NVMs). Inparticular, the incremental construction of FP-tree could generatemany unnecessary writes to the NVM and greatly degrade theenergy efficiency, because NVM writes typically take more timeand energy than reads. To overcome the drawbacks of FP-treeon NVMs, this paper proposes evergreen FP-tree (EvFP-tree), which includes a lazy counter and a minimum-bit-altered (MBA) encoding scheme to make FP-tree friendly for NVMs. The basicidea of the lazy counter is to greatly eliminate the redundantwrites generated in FP-tree construction. On the other hand, theMBA encoding scheme is to complement existing wear-levelingtechniques to evenly write each memory cell to extend the NVMlifetime. As verified by experiments, EvFP-tree greatly enhancesthe mining performance and system lifetime by 28.01% and82.10% on average, respectively.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121912849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Comparison of Accelerator Architectures for Radio-Astronomical Signal-Processing Algorithms 射电天文信号处理算法的加速器结构比较
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.62
J. Romein
{"title":"A Comparison of Accelerator Architectures for Radio-Astronomical Signal-Processing Algorithms","authors":"J. Romein","doi":"10.1109/ICPP.2016.62","DOIUrl":"https://doi.org/10.1109/ICPP.2016.62","url":null,"abstract":"In this paper, we compare a wide range of accelerator architectures (GPUs from AMD and NVIDIA, the Xeon Phi, and a DSP), by means of a signal-processing pipeline that processes radio-telescope data. We discuss the mapping of the algorithms from this pipeline to the accelerators, and analyze performance. We also analyze energy efficiency, using custom-built, microcontroller-based power sensors that measure the instantaneous power consumption of the accelerators, at millisecond time scale. We show that the GPUs are the fastest and most energy efficient accelerators, and that the differences in performance and energy efficiency are large.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128929096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Performance Analysis of GPU-Based Convolutional Neural Networks 基于gpu的卷积神经网络性能分析
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.15
Xiaqing Li, Guangyan Zhang, H. Howie Huang, Zhufan Wang, Weimin Zheng
{"title":"Performance Analysis of GPU-Based Convolutional Neural Networks","authors":"Xiaqing Li, Guangyan Zhang, H. Howie Huang, Zhufan Wang, Weimin Zheng","doi":"10.1109/ICPP.2016.15","DOIUrl":"https://doi.org/10.1109/ICPP.2016.15","url":null,"abstract":"As one of the most important deep learning models, convolutional neural networks (CNNs) have achieved great successes in a number of applications such as image classification, speech recognition and nature language understanding. Training CNNs on large data sets is computationally expensive, leading to a flurry of research and development of open-source parallel implementations on GPUs. However, few studies have been performed to evaluate the performance characteristics of those implementations. In this paper, we conduct a comprehensive comparison of these implementations over a wide range of parameter configurations, investigate potential performance bottlenecks and point out a number of opportunities for further optimization.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126569471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 109
High Performance MPI Library for Container-Based HPC Cloud on InfiniBand Clusters 基于ib集群的容器型HPC云高性能MPI库
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.38
Jie Zhang, Xiaoyi Lu, D. Panda
{"title":"High Performance MPI Library for Container-Based HPC Cloud on InfiniBand Clusters","authors":"Jie Zhang, Xiaoyi Lu, D. Panda","doi":"10.1109/ICPP.2016.38","DOIUrl":"https://doi.org/10.1109/ICPP.2016.38","url":null,"abstract":"Virtualization technology has grown rapidly over the past few decades. As a lightweight solution, container-based virtualization provides a promising approach to efficiently build HPC clouds. However, our study shows clear performance bottleneck when running MPI jobs on multi-container environments. This motivates us to first analyze the performance bottleneck for MPI jobs running in different container deployment scenarios. To eliminate performance bottleneck, we propose a high performance locality-aware MPI library, which is able to dynamically detect co-resident containers at runtime. Through this design, the MPI processes in co-resident containers can communicate to each other by shared memory and Cross Memory Attach (CMA) channels instead of the network channel. A comprehensive performance study indicates that compared with the default case, our proposed design can significantly improve the communication performance by up to 9X and 86% in terms of MPI point-to-point and collective operations, respectively. The results for applications demonstrate that the locality-aware design can reduce up to 16% of execution time. The evaluation results also show that by the help of locality-aware design, we can achieve near-native performance in container-based HPC cloud with minor overhead. The proposed locality-aware MPI design reveals significant potential to be utilized to efficiently build large scale container-based HPC clouds.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133400766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Partial Flattening: A Compilation Technique for Irregular Nested Parallelism on GPGPUs 部分平坦化:gpgpu上不规则嵌套并行的编译技术
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.70
Ming-Hsiang Huang, Wuu Yang
{"title":"Partial Flattening: A Compilation Technique for Irregular Nested Parallelism on GPGPUs","authors":"Ming-Hsiang Huang, Wuu Yang","doi":"10.1109/ICPP.2016.70","DOIUrl":"https://doi.org/10.1109/ICPP.2016.70","url":null,"abstract":"Supporting irregular nested parallelism on modern GPUs requires much effort. One should distribute the parallel tasks evenly while preserving reasonable memory usage. Moreover, the task distribution should also fit the thread hierarchy of the underlying GPU to fully exploit its computing power. We propose partial flattening, an automatic code transformation which translates annotated C programs to CUDA kernels. Thread blocks are treated as flat SIMT processors. Iterations are dynamically organized into batches. Batches are executed in a sequential (depth-first) order. A kernel is treated as multiple independent SIMT processors with an additional task-stealing mechanism. Partial flattening allows easy expression of nested parallelism and synchronization by annotating nested parallel loops or parallel-recursive calls, while preserving reasonable memory usage by the depth-first execution order. Our 2-level task distribution scheme does not need special hardware support, and fits well with the CUDA thread hierarchy. Experiments show that partial flattening outperforms NESL significantly in most benchmarks, and obtains 2.15x and 67x speedup over CUDA dynamic parallelism in Quicksort and the Bron-Kerbosch algorithm, respectively.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115055839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Resilient Application Co-scheduling with Processor Redistribution 弹性应用程序协同调度与处理器重新分配
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.21
A. Benoit, L. Pottier, Y. Robert
{"title":"Resilient Application Co-scheduling with Processor Redistribution","authors":"A. Benoit, L. Pottier, Y. Robert","doi":"10.1109/ICPP.2016.21","DOIUrl":"https://doi.org/10.1109/ICPP.2016.21","url":null,"abstract":"Recently, the benefits of co-scheduling several applications have been demonstrated in a fault-free context, both in terms of performance and energy savings. However, large-scale computer systems are confronted to frequent failures, and resilience techniques must be employed to ensure the completion of large applications. Indeed, failures may create severe imbalance between applications, and significantly degrade performance. In this paper, we propose to redistribute the resources assigned to each application upon the striking of failures, in order to minimize the expected completion time of a set of co-scheduled applications. First, we introduce a formal model and establish complexity results. When no redistribution is allowed, we can minimize the expected completion time in polynomial time, while the problem becomes NP-complete with redistributions, even in a fault-free context. Therefore, we design polynomial-time heuristics that perform redistributions and account for processor failures. A fault simulator is used to perform extensive simulations that demonstrate the usefulness of redistribution and the performance of the proposed heuristics.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122823454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
PCAF: Scalable, High Precision k-NN Search Using Principal Component Analysis Based Filtering PCAF:基于主成分分析滤波的可扩展、高精度k-NN搜索
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.79
Huan Feng, D. Eyers, S. Mills, Yongwei Wu, Zhiyi Huang
{"title":"PCAF: Scalable, High Precision k-NN Search Using Principal Component Analysis Based Filtering","authors":"Huan Feng, D. Eyers, S. Mills, Yongwei Wu, Zhiyi Huang","doi":"10.1109/ICPP.2016.79","DOIUrl":"https://doi.org/10.1109/ICPP.2016.79","url":null,"abstract":"Approximate k Nearest Neighbours (AkNN) search is widely used in domains such as computer vision and machine learning. However, AkNN search in high dimensional datasets does not work well on multicore platforms. It scales poorly due to its large memory footprint. Current parallel AkNN search using space subdivision for filtering helps reduce the memory footprint, but leads to loss of precision. We propose a new data filtering method -- PCAF -- for parallel AkNN search based on principal components analysis. PCAF improves on previous methods by demonstrating sustained, high scalability for a wide range of high dimensional datasets on both Intel and AMD multicore platforms. Moreover, PCAF maintains high precision in terms of the AkNN search results.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130394725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Efficient Virtual Network Embedding for Variable Size Virtual Machines in Fat-Tree Data Centers 胖树数据中心中可变大小虚拟机的高效虚拟网络嵌入
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.8
Jun Duan, Yuanyuan Yang
{"title":"Efficient Virtual Network Embedding for Variable Size Virtual Machines in Fat-Tree Data Centers","authors":"Jun Duan, Yuanyuan Yang","doi":"10.1109/ICPP.2016.8","DOIUrl":"https://doi.org/10.1109/ICPP.2016.8","url":null,"abstract":"Network virtualization is the enabling technology for sharing resources on cloud. The efficiency of virtual network embedding determines the expense and revenue ratio of a data center. In this paper, we consider the virtual network embedding problem in fat-tree data centers. We design various schemes to embed Nonblocking Multicast Virtual Networks (NMVNs) which are dedicated to deliver premium experience to cloud users. In each NMVN, there is a free combination of virtual machines selected from variable sizes. The bottleneck of communications between these virtual machines is removed so that they can always send data at full bandwidth of their network interface, even if data is simultaneously sent to multiple destinations. In addition, the high performance of NMVNs is guaranteed at the wellcontrolled low network hardware cost. We design two embedding schemes for NMVNs, named Static NMVN Embedding (SNE) and Dynamic NMVN Embedding (DNE). Both schemes support the nonblocking properties for multicast. Besides, each of the two schemes has its unique features. The SNE scheme provides an interference-free solution, in the sense that a virtual network is not aware of the existence of other virtual networks during its lifetime. The DNE scheme has lower hardware cost than SNE and provides higher flexibility to cloud users by possible reconfigurations when necessary. Additionally, we show through theoretical analysis and simulations to validate that the overhead of DNE is minimal thus acceptable to most cloud applications.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130230699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信