2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

筛选
英文 中文
PhiOpenSSL: Using the Xeon Phi Coprocessor for Efficient Cryptographic Calculations PhiOpenSSL:使用Xeon Phi协处理器进行高效加密计算
Shun Yao, Dantong Yu
{"title":"PhiOpenSSL: Using the Xeon Phi Coprocessor for Efficient Cryptographic Calculations","authors":"Shun Yao, Dantong Yu","doi":"10.1109/IPDPS.2017.32","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.32","url":null,"abstract":"The Secure Sockets Layer (SSL) is the main protocol used to secure Internet traffic and cloud computing. It relies on the computation-intensive RSA cryptography, which primarily limits the throughput of the handshake process. In this paper, we design and implement an OpenSSL library, termed PhiOpenSSL, which targets the Intel Xeon Phi (KNC) coprocessor, and utilizes Intel Phi's SIMD and multi-threading capability to reduce the SSL computation latency. In particular, PhiOpenSSL vectorizes all big integer multiplications and Montgomery operations involved in RSA calculations and employs theChinese Remainder Theorem and fixed-window exponentiation in its customized library. In an experiment involving the computation of Montgomery exponentiation, PhiOpenSSL was as much as 15.3 times faster than the two other reference libcrypto libraries, one from the Intel Many-core Platform Software Stack (MPSS) and the other from the default OpenSSL. Our RSA private key cryptography routines in PhiOpenSSL are 1.6-5.7 times faster than those in these two reference systems.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128274218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Automatic Collapsing of Non-Rectangular Loops 自动折叠的非矩形回路
P. Clauss, Ervin Altintas, M. Kuhn
{"title":"Automatic Collapsing of Non-Rectangular Loops","authors":"P. Clauss, Ervin Altintas, M. Kuhn","doi":"10.1109/IPDPS.2017.34","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.34","url":null,"abstract":"Loop collapsing is a well-known loop transformation which combines some loops that are perfectly nested into one single loop. It allows to take advantage of the whole amount of parallelism exhibited by the collapsed loops, and provides a perfect load balancing of iterations among the parallel threads. However, in the current implementations of this loop optimization, as the ones of the OpenMP language, automatic loop collapsing is limited to loops with constant loop bounds that define rectangular iteration spaces, although load imbalance is a particularly crucial issue with non-rectangular loops. The OpenMP language addresses load balance mostly through dynamic runtime scheduling of the parallel threads. Nevertheless, this runtime schedule introduces some unavoidable executiontime overhead, while preventing to exploit the entire parallelism of all the parallel loops. In this paper, we propose a technique to automatically collapse any perfectly nested loops defining non-rectangular iteration spaces, whose bounds are linear functions of the loop iterators. Such spaces may be triangular, tetrahedral, trapezoidal, rhomboidal or parallelepiped. Our solution is based on original mathematical results addressing the inversion of a multi-variate polynomial that defines a ranking of the integer points contained in a convex polyhedron. We show on a set of non-rectangular loop nests that our technique allows to generate parallel OpenMP codes that outperform the original parallel loop nests, parallelized either by using options “static” or “dynamic” of the OpenMPschedule clause.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130328646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Elastic Data Compression with Improved Performance and Space Efficiency for Flash-Based Storage Systems 基于闪存的存储系统的弹性数据压缩性能和空间效率提高
Bo Mao, Hong Jiang, Suzhen Wu, Yaodong Yang, Zaifa Xi
{"title":"Elastic Data Compression with Improved Performance and Space Efficiency for Flash-Based Storage Systems","authors":"Bo Mao, Hong Jiang, Suzhen Wu, Yaodong Yang, Zaifa Xi","doi":"10.1109/IPDPS.2017.64","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.64","url":null,"abstract":"Data compression has become a commodity feature for space efficiency and reliability in flash-based storage systems by reducing write traffic and space capacity demand. However, it introduces noticeable processing overheads on the critical I/O path, which degrades the system performance significantly. Existing data compression schemes for flash-based storage systems use fixed compression algorithms for all the incoming write data, failing to recognize and exploit the significant diversity in compressibility and access patterns of data and missing an opportunity to improve the system performance, the space efficiency or both. To achieve a reasonable trade-off between these two important design objectives, in this paper we introduce an Elastic Data Compression scheme, called EDC, which exploits the data compressibility and access intensity characteristics by judiciously matching data of different compressibility with different compression algorithms while leveraging the access idleness. Specifically, for compressible data blocks EDC exploits the compression diversity of the workload, and employs algorithms of higher compression rate in periods of lower system utilization and algorithms of lower compression rate in periods of higher system utilization. For non-compressible (or very lowly compressible) data blocks, it will write them through to the flash storage directly without any compression. The experiments conducted on our lightweight prototype implementation of the EDC system show that EDC saves storage space by up to 38.7%, with an average of 33.7%. In addition, it significantly outperforms the fixed compression schemes in the I/O performance measure by up to 61.4%, with an average of 36.7%.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133859746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Directive-Based Partitioning and Pipelining for Graphics Processing Units 图形处理单元的基于指令的分区和流水线
Xuewen Cui, T. Scogland, B. Supinski, Wu-chun Feng
{"title":"Directive-Based Partitioning and Pipelining for Graphics Processing Units","authors":"Xuewen Cui, T. Scogland, B. Supinski, Wu-chun Feng","doi":"10.1109/IPDPS.2017.96","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.96","url":null,"abstract":"The community needs simpler mechanisms to access the performance available in accelerators, such as GPUs, FPGAs, and APUs, due to their increasing use in state-of-the-art supercomputers. Programming models like CUDA, OpenMP, OpenACC and OpenCL can efficiently offload compute-intensive workloads to these devices. By default these models naively offload computation without overlapping it with communication (copying data to or from the device). Achieving performance can require extensive refactoring and hand-tuning to apply optimizations such as pipelining. Further, users must manually partition the dataset whenever its size is larger than device memory, which can be especially difficult when the device memory size is not exposed to the user. We propose a directive-based partitioning and pipelining extension for accelerators appropriate for either OpenMP or OpenACC. Its interface supports overlap of data transfers and kernel computation without explicit user splitting of data. It can map data to a pre-allocated device buffer and automate memory-constrained array indexing and sub-task scheduling. We evaluate a prototype implementation with four different applications. The experimental results show that our approach can reduce memory usage by 52% to 97% while delivering a 1.41× to 1.65× speedup over the naive offload model.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134339175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Algorithms for Hierarchical and Semi-Partitioned Parallel Scheduling 分层和半分区并行调度算法
V. Bonifaci, Gianlorenzo D'angelo, A. Marchetti-Spaccamela
{"title":"Algorithms for Hierarchical and Semi-Partitioned Parallel Scheduling","authors":"V. Bonifaci, Gianlorenzo D'angelo, A. Marchetti-Spaccamela","doi":"10.1109/IPDPS.2017.22","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.22","url":null,"abstract":"We propose a model for scheduling jobs in a parallel machine setting that takes into account the cost of migrations by assuming that the processing time of a job may depend on the specific set of machines among which the job is migrated. For the makespan minimization objective, the model generalizes classical scheduling problems such as unrelated parallel machine scheduling, as well as novel ones such as semi-partitioned and clustered scheduling. In the case of a hierarchical family of machines, we derive a compact integer linear programming formulation of the problem and leverage its fractional relaxation to obtain a polynomial-time 2-approximation algorithm. Extensions that incorporate memory capacity constraints are also discussed.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134110015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms 基于GPU的通信优化:以序列对齐算法为例
Jie Wang, Xinfeng Xie, J. Cong
{"title":"Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms","authors":"Jie Wang, Xinfeng Xie, J. Cong","doi":"10.1109/IPDPS.2017.79","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.79","url":null,"abstract":"Data movement is increasingly becoming the bottleneck of both performance and energy efficiency in modern computation. Until recently, it was the case that there is limited freedom for communication optimization on GPUs, as conventional GPUs only provide two types of methods for inter-thread communication: using shared memory or global memory. However, a new warp shuffle instruction has been introduced since the Kepler architecture on Nvidia GPUs, which enables threads within the same warp to directly exchange data in registers. This brought new performance optimization opportunities for algorithms with intensive inter-thread communication. In this work, we deploy register shuffle in the application domain of sequence alignment (or similarly, string matching), and conduct a quantitative analysis of the opportunities and limitations of using register shuffle. We select two sequence alignment algorithms, Smith-Waterman (SW) and Pairwise-Hidden-Markov-Model (PairHMM), from the widely used Genome Analysis Toolkit (GATK) as case studies. Compared to implementations using shared memory, we obtain a significant speed-up of 1.2× and 2.1× by using shuffle instructions for SW and PairHMM. Furthermore, we develop a performance model for analyzing the kernel performance based on the measured shuffle latency from a suite of microbenchmarks. Our model provides valuable insights for CUDA programmers into how to best use shuffle instructions for performance optimization.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130045694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Power Efficient Sharing-Aware GPU Data Management 节能共享感知GPU数据管理
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.106
Abdulaziz Tabbakh, M. Annavaram, Xuehai Qian
{"title":"Power Efficient Sharing-Aware GPU Data Management","authors":"Abdulaziz Tabbakh, M. Annavaram, Xuehai Qian","doi":"10.1109/IPDPS.2017.106","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.106","url":null,"abstract":"The power consumed by memory system in GPUs is a significant fraction of the total chip power. As thread level parallelism increases, GPUs are likely to stress cache and memory bandwidth even more, thereby exacerbating power consumption. We observe that neighboring concurrent thread arrays (CTAs) within GPU applications share considerable amount of data. However, the default GPU scheduling policy spreads these CTAs to different streaming multiprocessor cores (SM) in a round-robin fashion. Since each SM has a private L1 cache, the shared data among CTAs are replicated across L1 caches of different SMs. Data replication reduces the effective L1 cache size which in turn increases the data movement and power consumption. The goal of this paper is to reduce data movement and increase effective cache space in GPUs. We propose a sharing-aware CTA scheduler that attempts to assign CTAs with data sharing to the same SM to reduce redundant storage of data in private L1 caches across SMs. We further enhance the scheduler with a sharing-aware cache allocation and replacement policy. The sharing-aware cache management approach dynamically classifies private and shared data. Private blocks are given higher priority to stay longer in L1 cache, and shared blocks are given higher priority to stay longer in L2 cache. Essentially, this approach increases the lifetime of shared blocks and private blocks in different cache levels. The experimental results show that the proposed scheme reduces the off-chip traffic by 19% which translates to an average DRAM power reduction of 10% and performance improvement of 7%.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130197682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Distributed Vehicle Routing Approximation 分布式车辆路线近似
A. Krishnan, Mikhail Markov, Borzoo Bonakdarpour
{"title":"Distributed Vehicle Routing Approximation","authors":"A. Krishnan, Mikhail Markov, Borzoo Bonakdarpour","doi":"10.1109/IPDPS.2017.90","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.90","url":null,"abstract":"The classic vehicle routing problem (VRP) is generally concerned with the optimal design of routes by a fleet of vehicles to service a set of customers by minimizing the overall cost, usually the travel distance for the whole set of routes. Although the problem has been extensively studied in the context of operations research and optimization, there is little research on solving the VRP, where distributed vehicles need to compute their respective routes in a decentralized fashion. Our first contribution is a synchronous distributed approximation algorithm that solves the VRP. Using the duality theorem of linear programming, we show that the approximation ratio of our algorithm is O(n · (ρ)1/n log(n + m)), where ρ is the maximum cost of travel or service in the input VRP instance, n is the size of the graph, and m is the number of vehicles. We report results of simulations and discuss implementation of our algorithm on a real fleet of unmanned aerial systems (UASs) that carry out a set of tasks.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129178154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight “神威太湖之光”大气模拟的PFLOPS模板计算
Yulong Ao, Chao Yang, Xinliang Wang, Wei Xue, H. Fu, Fangfang Liu, L. Gan, Ping Xu, Wenjing Ma
{"title":"26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight","authors":"Yulong Ao, Chao Yang, Xinliang Wang, Wei Xue, H. Fu, Fangfang Liu, L. Gan, Ping Xu, Wenjing Ma","doi":"10.1109/IPDPS.2017.9","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.9","url":null,"abstract":"Stencil computation arises from a broad set of scientific and engineering applications and often plays a critical role in the performance of extreme-scale simulations. Due to the memory bound nature, it is a challenging task to opti- mize stencil computation kernels on modern supercomputers with relatively high computing throughput whilst relatively low data-moving capability. This work serves as a demon- stration on the details of the algorithms, implementations and optimizations of a real-world stencil computation in 3D nonhydrostatic atmospheric modeling on the newly announced Sunway TaihuLight supercomputer. At the algorithm level, we present a computation-communication overlapping technique to reduce the inter-process communication overhead, a locality- aware blocking method to fully exploit on-chip parallelism with enhanced data locality, and a collaborative data accessing scheme for sharing data among different threads. In addition, a variety of effective hardware specific implementation and optimization strategies on both the process- and thread-level, from the fine-grained data management to the data layout transformation, are developed to further improve the per- formance. Our experiments demonstrate that a single-process many-core speedup of as high as 170x can be achieved by using the proposed algorithm and optimization strategies. The code scales well to millions of cores in terms of strong scalability. And for the weak-scaling tests, the code can scale in a nearly ideal way to the full system scale of more than 10 million cores, sustaining 25.96 PFLOPS in double precision, which is 20% of the peak performance.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127954560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Community Detection on the GPU GPU上的团体字检测
M. Naim, F. Manne, M. Halappanavar, Antonino Tumeo
{"title":"Community Detection on the GPU","authors":"M. Naim, F. Manne, M. Halappanavar, Antonino Tumeo","doi":"10.1109/IPDPS.2017.16","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.16","url":null,"abstract":"We present and evaluate a new GPU algorithm based on the Louvain method for community detection. Our algorithm is the first for this problem that parallelizes the access to individual edges. In this way we can fine tune the load balance when processing networks with nodes of highly varying degrees. This is achieved by scaling the number of threads assigned to each node according to its degree. Extensive experiments show that we obtain speedups up to a factor of 270 compared to the sequential algorithm. The algorithm consistently outperforms other recent shared memory implementationsand is only one order of magnitude slower than the current fastest parallel Louvain method running on a Blue Gene/Q supercomputer using more than 500K threads.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129981241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信