2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

筛选
英文 中文
Algorithms for Hierarchical and Semi-Partitioned Parallel Scheduling 分层和半分区并行调度算法
V. Bonifaci, Gianlorenzo D'angelo, A. Marchetti-Spaccamela
{"title":"Algorithms for Hierarchical and Semi-Partitioned Parallel Scheduling","authors":"V. Bonifaci, Gianlorenzo D'angelo, A. Marchetti-Spaccamela","doi":"10.1109/IPDPS.2017.22","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.22","url":null,"abstract":"We propose a model for scheduling jobs in a parallel machine setting that takes into account the cost of migrations by assuming that the processing time of a job may depend on the specific set of machines among which the job is migrated. For the makespan minimization objective, the model generalizes classical scheduling problems such as unrelated parallel machine scheduling, as well as novel ones such as semi-partitioned and clustered scheduling. In the case of a hierarchical family of machines, we derive a compact integer linear programming formulation of the problem and leverage its fractional relaxation to obtain a polynomial-time 2-approximation algorithm. Extensions that incorporate memory capacity constraints are also discussed.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134110015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms 基于GPU的通信优化:以序列对齐算法为例
Jie Wang, Xinfeng Xie, J. Cong
{"title":"Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms","authors":"Jie Wang, Xinfeng Xie, J. Cong","doi":"10.1109/IPDPS.2017.79","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.79","url":null,"abstract":"Data movement is increasingly becoming the bottleneck of both performance and energy efficiency in modern computation. Until recently, it was the case that there is limited freedom for communication optimization on GPUs, as conventional GPUs only provide two types of methods for inter-thread communication: using shared memory or global memory. However, a new warp shuffle instruction has been introduced since the Kepler architecture on Nvidia GPUs, which enables threads within the same warp to directly exchange data in registers. This brought new performance optimization opportunities for algorithms with intensive inter-thread communication. In this work, we deploy register shuffle in the application domain of sequence alignment (or similarly, string matching), and conduct a quantitative analysis of the opportunities and limitations of using register shuffle. We select two sequence alignment algorithms, Smith-Waterman (SW) and Pairwise-Hidden-Markov-Model (PairHMM), from the widely used Genome Analysis Toolkit (GATK) as case studies. Compared to implementations using shared memory, we obtain a significant speed-up of 1.2× and 2.1× by using shuffle instructions for SW and PairHMM. Furthermore, we develop a performance model for analyzing the kernel performance based on the measured shuffle latency from a suite of microbenchmarks. Our model provides valuable insights for CUDA programmers into how to best use shuffle instructions for performance optimization.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130045694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Power Efficient Sharing-Aware GPU Data Management 节能共享感知GPU数据管理
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.106
Abdulaziz Tabbakh, M. Annavaram, Xuehai Qian
{"title":"Power Efficient Sharing-Aware GPU Data Management","authors":"Abdulaziz Tabbakh, M. Annavaram, Xuehai Qian","doi":"10.1109/IPDPS.2017.106","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.106","url":null,"abstract":"The power consumed by memory system in GPUs is a significant fraction of the total chip power. As thread level parallelism increases, GPUs are likely to stress cache and memory bandwidth even more, thereby exacerbating power consumption. We observe that neighboring concurrent thread arrays (CTAs) within GPU applications share considerable amount of data. However, the default GPU scheduling policy spreads these CTAs to different streaming multiprocessor cores (SM) in a round-robin fashion. Since each SM has a private L1 cache, the shared data among CTAs are replicated across L1 caches of different SMs. Data replication reduces the effective L1 cache size which in turn increases the data movement and power consumption. The goal of this paper is to reduce data movement and increase effective cache space in GPUs. We propose a sharing-aware CTA scheduler that attempts to assign CTAs with data sharing to the same SM to reduce redundant storage of data in private L1 caches across SMs. We further enhance the scheduler with a sharing-aware cache allocation and replacement policy. The sharing-aware cache management approach dynamically classifies private and shared data. Private blocks are given higher priority to stay longer in L1 cache, and shared blocks are given higher priority to stay longer in L2 cache. Essentially, this approach increases the lifetime of shared blocks and private blocks in different cache levels. The experimental results show that the proposed scheme reduces the off-chip traffic by 19% which translates to an average DRAM power reduction of 10% and performance improvement of 7%.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130197682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Distributed Vehicle Routing Approximation 分布式车辆路线近似
A. Krishnan, Mikhail Markov, Borzoo Bonakdarpour
{"title":"Distributed Vehicle Routing Approximation","authors":"A. Krishnan, Mikhail Markov, Borzoo Bonakdarpour","doi":"10.1109/IPDPS.2017.90","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.90","url":null,"abstract":"The classic vehicle routing problem (VRP) is generally concerned with the optimal design of routes by a fleet of vehicles to service a set of customers by minimizing the overall cost, usually the travel distance for the whole set of routes. Although the problem has been extensively studied in the context of operations research and optimization, there is little research on solving the VRP, where distributed vehicles need to compute their respective routes in a decentralized fashion. Our first contribution is a synchronous distributed approximation algorithm that solves the VRP. Using the duality theorem of linear programming, we show that the approximation ratio of our algorithm is O(n · (ρ)1/n log(n + m)), where ρ is the maximum cost of travel or service in the input VRP instance, n is the size of the graph, and m is the number of vehicles. We report results of simulations and discuss implementation of our algorithm on a real fleet of unmanned aerial systems (UASs) that carry out a set of tasks.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129178154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight “神威太湖之光”大气模拟的PFLOPS模板计算
Yulong Ao, Chao Yang, Xinliang Wang, Wei Xue, H. Fu, Fangfang Liu, L. Gan, Ping Xu, Wenjing Ma
{"title":"26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight","authors":"Yulong Ao, Chao Yang, Xinliang Wang, Wei Xue, H. Fu, Fangfang Liu, L. Gan, Ping Xu, Wenjing Ma","doi":"10.1109/IPDPS.2017.9","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.9","url":null,"abstract":"Stencil computation arises from a broad set of scientific and engineering applications and often plays a critical role in the performance of extreme-scale simulations. Due to the memory bound nature, it is a challenging task to opti- mize stencil computation kernels on modern supercomputers with relatively high computing throughput whilst relatively low data-moving capability. This work serves as a demon- stration on the details of the algorithms, implementations and optimizations of a real-world stencil computation in 3D nonhydrostatic atmospheric modeling on the newly announced Sunway TaihuLight supercomputer. At the algorithm level, we present a computation-communication overlapping technique to reduce the inter-process communication overhead, a locality- aware blocking method to fully exploit on-chip parallelism with enhanced data locality, and a collaborative data accessing scheme for sharing data among different threads. In addition, a variety of effective hardware specific implementation and optimization strategies on both the process- and thread-level, from the fine-grained data management to the data layout transformation, are developed to further improve the per- formance. Our experiments demonstrate that a single-process many-core speedup of as high as 170x can be achieved by using the proposed algorithm and optimization strategies. The code scales well to millions of cores in terms of strong scalability. And for the weak-scaling tests, the code can scale in a nearly ideal way to the full system scale of more than 10 million cores, sustaining 25.96 PFLOPS in double precision, which is 20% of the peak performance.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127954560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Community Detection on the GPU GPU上的团体字检测
M. Naim, F. Manne, M. Halappanavar, Antonino Tumeo
{"title":"Community Detection on the GPU","authors":"M. Naim, F. Manne, M. Halappanavar, Antonino Tumeo","doi":"10.1109/IPDPS.2017.16","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.16","url":null,"abstract":"We present and evaluate a new GPU algorithm based on the Louvain method for community detection. Our algorithm is the first for this problem that parallelizes the access to individual edges. In this way we can fine tune the load balance when processing networks with nodes of highly varying degrees. This is achieved by scaling the number of threads assigned to each node according to its degree. Extensive experiments show that we obtain speedups up to a factor of 270 compared to the sequential algorithm. The algorithm consistently outperforms other recent shared memory implementationsand is only one order of magnitude slower than the current fastest parallel Louvain method running on a Blue Gene/Q supercomputer using more than 500K threads.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129981241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Characterizing and Modeling Power and Energy for Extreme-Scale In-Situ Visualization 表征和建模的能量和能量的极端尺度现场可视化
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.113
Vignesh Adhinarayanan, Wu-chun Feng, D. Rogers, J. Ahrens, S. Pakin
{"title":"Characterizing and Modeling Power and Energy for Extreme-Scale In-Situ Visualization","authors":"Vignesh Adhinarayanan, Wu-chun Feng, D. Rogers, J. Ahrens, S. Pakin","doi":"10.1109/IPDPS.2017.113","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.113","url":null,"abstract":"Plans for exascale computing have identified power and energy as looming problems for simulations running at that scale. In particular, writing to disk all the data generated by these simulations is becoming prohibitively expensive due to the energy consumption of the supercomputer while it idles waiting for data to be written to permanent storage. In addition, the power cost of data movement is also steadily increasing. A solution to this problem is to write only a small fraction of the data generated while still maintaining the cognitive fidelity of the visualization. With domain scientists increasingly amenable towards adopting an in-situ framework that can identify and extract valuable data from extremely large simulation results and write them to permanent storage as compact images, a large-scale simulation will commit to disk a reduced dataset of data extracts that will be much smaller than the raw results, resulting in a savings in both power and energy. The goal of this paper is two-fold: (i) to understand the role of in-situ techniques in combating power and energy issues of extreme-scale visualization and (ii) to create a model for performance, power, energy, and storage to facilitate what-if analysis. Our experiments on a specially instrumented, dedicated 150-node cluster show that while it is difficult to achieve power savings in practice using in-situ techniques, applications can achieve significant energy savings due to shorter write times for in-situ visualization. We present a characterization of power and energy for in-situ visualization; an application-aware, architecturespecific methodology for modeling and analysis of such in-situ workflows; and results that uncover indirect power savings in visualization workflows for high-performance computing (HPC).","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121495763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Accommodating Thread-Level Heterogeneity in Coupled Parallel Applications 在耦合并行应用程序中适应线程级异构性
S. Gutierrez, K. Davis, D. Arnold, R. Baker, R. Robey, P. McCormick, Daniel Holladay, J. Dahl, R. Zerr, Florian Weik, Christoph Junghans
{"title":"Accommodating Thread-Level Heterogeneity in Coupled Parallel Applications","authors":"S. Gutierrez, K. Davis, D. Arnold, R. Baker, R. Robey, P. McCormick, Daniel Holladay, J. Dahl, R. Zerr, Florian Weik, Christoph Junghans","doi":"10.1109/IPDPS.2017.13","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.13","url":null,"abstract":"Hybrid parallel program models that combine message passing and multithreading (MP+MT) are becoming more popular, extending the basic message passing (MP) model that uses single-threaded processes for both inter- and intra-node parallelism. A consequence is that coupled parallel applications increasingly comprise MP libraries together with MP+MT libraries with differing preferred degrees of threading, resulting in thread-level heterogeneity. Retroactively matching threading levels between independently developed and maintained libraries is difficult; the challenge is exacerbated because contemporary parallel job launchers provide only static resource binding policies over entire application executions. A standard approach for accommodating thread-level heterogeneity is to under-subscribe compute resources such that the library with the highest degree of threading per process has one processing element per thread. This results in libraries with fewer threads per process utilizing only a fraction of the available compute resources. We present and evaluate a novel approach for accommodating thread-level heterogeneity. Our approach enables full utilization of all available compute resources throughout an application's execution by providing programmable facilities to dynamically reconfigure runtime environments for compute phases with differing threading factors and memory affinities. We show that our approach can improve overall application performance by up to 5.8x in real-world production codes. Furthermore, the practicality and utility of our approach has been demonstrated by continuous production use for over one year, and by more recent incorporation into a number of production codes.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131075388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Data Centric Performance Measurement Techniques for Chapel Programs 以数据为中心的教堂项目性能测量技术
Hui Zhang, J. Hollingsworth
{"title":"Data Centric Performance Measurement Techniques for Chapel Programs","authors":"Hui Zhang, J. Hollingsworth","doi":"10.1109/IPDPS.2017.37","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.37","url":null,"abstract":"Chapel is an emerging PGAS (Partitioned Global Address Space) language whose design goal is to make parallel programming more productive and generally accessible. To date, the implementation effort has focused primarily on correctness over performance. We present a performance measurement technique for Chapel and the idea is also applicable to other PGAS models. The unique feature of our tool is that it associates the performance statistics not to the code regions (functions), but to the variables (including the heap allocated, static, and local variables) in the source code. Unlike code-centric methods, this data-centric analysis capability exposes new optimization opportunities that are useful in resolving data locality problems. This paper introduces our idea and implementations of the approach with three benchmarks. We also include a case study optimizing benchmarks based on the information from our tool. The optimized versions improved the performance by a factor of 1.4x for LULESH, 2.3x for MiniMD, and 2.1x for CLOMP with simple modifications to the source code.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128604050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Memory Compression Techniques for Network Address Management in MPI MPI中网络地址管理的内存压缩技术
Yanfei Guo, C. Archer, M. Blocksome, Scott Parker, Wesley Bland, Kenneth Raffenetti, P. Balaji
{"title":"Memory Compression Techniques for Network Address Management in MPI","authors":"Yanfei Guo, C. Archer, M. Blocksome, Scott Parker, Wesley Bland, Kenneth Raffenetti, P. Balaji","doi":"10.1109/IPDPS.2017.18","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.18","url":null,"abstract":"MPI allows applications to treat processes as a logical collection of integer ranks for each MPI communicator, while internally translating these logical ranks into actual network addresses. In current MPI implementations the management and lookup of such network addresses use memory sizes that are proportional to the number of processes in each communicator. In this paper, we propose a new mechanism, called AV-Rankmap, for managing such translation. AV-Rankmap takes advantage of logical patterns in rank-address mapping that most applications naturally tend to have, and it exploits the fact that some parts of network address structures are naturally more performance critical than others. It uses this information to compress the memory used for network address management. We demonstrate that AV-Rankmap can achieve performance similar to or better than that of other MPI implementations while using significantly less memory.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131325275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信