2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

筛选
英文 中文
Localized Fault Recovery for Nested Fork-Join Programs 嵌套Fork-Join程序的局部故障恢复
Gokcen Kestor, S. Krishnamoorthy, Wenjing Ma
{"title":"Localized Fault Recovery for Nested Fork-Join Programs","authors":"Gokcen Kestor, S. Krishnamoorthy, Wenjing Ma","doi":"10.1109/IPDPS.2017.75","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.75","url":null,"abstract":"Nested fork-join programs scheduled using work stealing can automatically balance load and adapt to changes in the execution environment. In this paper, we design an approach to efficiently recover from faults encountered by these programs. Specifically, we focus on localized recovery of the task space in the presence of fail-stop failures. We present an approach to efficiently track, under work stealing, the relationships between the work executed by various threads. This information is used to identify and schedule the tasks to be re-executed without interfering with normal task execution. The algorithm precisely computes the work lost, incurs minimal re-execution overhead, and can recover from an arbitrary number of failures. Experimental evaluation demonstrates low overheads in the absence of failures, recovery overheads on the same order as the lost work, and much lower recovery costs than alternative strategies.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132822982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Design and Implementation of Papyrus: Parallel Aggregate Persistent Storage Papyrus的设计与实现:并行聚合持久存储
Jungwon Kim, Kittisak Sajjapongse, Seyong Lee, J. Vetter
{"title":"Design and Implementation of Papyrus: Parallel Aggregate Persistent Storage","authors":"Jungwon Kim, Kittisak Sajjapongse, Seyong Lee, J. Vetter","doi":"10.1109/IPDPS.2017.72","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.72","url":null,"abstract":"A surprising development in recently announced HPC platforms is the addition of, sometimes massive amounts of, persistent (nonvolatile) memory (NVM) in order to increase memory capacity and compensate for plateauing I/O capabilities. However, there are no portable and scalable programming interfaces using aggregate NVM effectively. This paper introduces Papyrus: a new software system built to exploit emerging capability of NVM in HPC architectures. Papyrus (or Parallel Aggregate Persistent -YRU- Storage) is a novel programming system that provides features for scalable, aggregate, persistent memory in an extreme-scale system for typical HPC usage scenarios. Papyrus mainly consists of Papyrus Virtual File System (VFS) and Papyrus Template Container Library (TCL). Papyrus VFS provides a uniform aggregate NVM storage image across diverse NVM architectures. It enables Papyrus TCL to provide a portable and scalable high-level container programming interface whose data elements are distributed across multiple NVM nodes without requiring the user to handle complex communication, synchronization, replication, and consistency model. We evaluate Papyrus on two HPC systems, including UTK Beacon and NERSC Cori, using real NVM storage devices.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131848758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
HOMP: Automated Distribution of Parallel Loops and Data in Highly Parallel Accelerator-Based Systems 高度并行加速器系统中并行环路和数据的自动分布
Yonghong Yan, Jiawen Liu, K. Cameron, M. Umar
{"title":"HOMP: Automated Distribution of Parallel Loops and Data in Highly Parallel Accelerator-Based Systems","authors":"Yonghong Yan, Jiawen Liu, K. Cameron, M. Umar","doi":"10.1109/IPDPS.2017.99","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.99","url":null,"abstract":"Heterogeneous computing systems, e.g., those with accelerators than the host CPUs, offer the accelerated performance for a variety of workloads. However, most parallel programming models require platform dependent, time-consuming hand-tuning efforts for collectively using all the resources in a system to achieve efficient results. In this work, we explore the use of OpenMP parallel language extensions to empower users with the ability to design applications that automatically and simultaneously leverage CPUs and accelerators to further optimize use of available resources. We believe such automation will be key to ensuring codes adapt to increases in the number and diversity of accelerator resources for future computing systems. The proposed system combines language extensions to OpenMP, load-balancing algorithms and heuristics, and a runtime system for loop distribution across heterogeneous processing elements. We demonstrate the effectiveness of our automated approach to program on systems with multiple CPUs, GPUs, and MICs.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114423383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Application Level Reordering of Remote Direct Memory Access Operations 远程直接内存访问操作的应用程序级重新排序
W. Lavrijsen, Costin Iancu
{"title":"Application Level Reordering of Remote Direct Memory Access Operations","authors":"W. Lavrijsen, Costin Iancu","doi":"10.1109/IPDPS.2017.98","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.98","url":null,"abstract":"We present methods for the effective application level reordering of non-blocking RDMA operations. We supplement out-of-order hardware delivery mechanisms with heuristics to account for the CPU side overhead of communication and for differences in network latency: a runtime scheduler takes into account message sizes, destination and concurrency and reorders operations to improve overall communication throughput. Results are validated on InfiniBand and Cray Aries networks, for SPMD and hybrid (SPMD+OpenMP) programming models. We show up to 5! potential speedup, with 30-50% more typical, for synthetic message patterns in microbenchmarks. We also obtain up to 33% improvement in the communication stages in application settings. While the design space is complex, the resulting scheduler is simple, both internally and at the application level interfaces. It also provides performance portability across networks and programming models. We believe these techniques can be easily retrofitted within any application or runtime framework that uses one-sided communication, e.g. using GASNet, MPI 3.0 RMA or low level APIs such as IBVerbs.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116174080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
E^2MC: Entropy Encoding Based Memory Compression for GPUs 基于熵编码的gpu内存压缩
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.101
S. Lal, J. Lucas, B. Juurlink
{"title":"E^2MC: Entropy Encoding Based Memory Compression for GPUs","authors":"S. Lal, J. Lucas, B. Juurlink","doi":"10.1109/IPDPS.2017.101","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.101","url":null,"abstract":"Modern Graphics Processing Units (GPUs) provide much higher off-chip memory bandwidth than CPUs, but many GPU applications are still limited by memory bandwidth. Unfortunately, off-chip memory bandwidth is growing slower than the number of cores and has become a performance bottleneck. Thus, optimizations of effective memory bandwidth play a significant role for scaling the performance of GPUs. Memory compression is a promising approach for improving memory bandwidth which can translate into higher performance and energy efficiency. However, compression is not free and its challenges need to be addressed, otherwise the benefits of compression may be offset by its overhead. We propose an entropy encoding based memory compression (E2MC) technique for GPUs, which is based on the well-known Huffman encoding. We study the feasibility of entropy encoding for GPUs and show that it achieves higher compression ratios than state-of-the-art GPU compression techniques. Furthermore, we address the key challenges of probability estimation, choosing an appropriate symbol length for encoding, and decompression with low latency. The average compression ratio of E2MC is 53% higher than the state of the art. This translates into an average speedup of 20% compared to no compression and 8% higher compared to the state of the art. Energy consumption and energy-delayproduct are reduced by 13% and 27%, respectively. Moreover, the compression ratio achieved by E2MC is close to the optimal compression ratio given by Shannon’s source coding theorem.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123194738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
A Scalable System Architecture to Addressing the Next Generation of Predictive Simulation Workflows with Coupled Compute and Data Intensive Applications 一个可扩展的系统架构,以解决与耦合计算和数据密集型应用程序的下一代预测仿真工作流
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.129
M. Seager
{"title":"A Scalable System Architecture to Addressing the Next Generation of Predictive Simulation Workflows with Coupled Compute and Data Intensive Applications","authors":"M. Seager","doi":"10.1109/IPDPS.2017.129","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.129","url":null,"abstract":"Trends in the emerging digital economy are pushing the virtual representation of products and services. Creating these digital twins requires a combination of real time data ingestion, simulation of physical products under real world conditions, service delivery optimization and data analytics as well as ML/DL anomaly detection and decision making. Quantification of Uncertainty in the simulations will also be a compute and data intensive workflow that will drive the simulation improvement cycle. Future high-end computing systems designs need to comprehend these types of complex workflows and provide a flexible framework for optimizing the design and operations under dynamic load conditions for them.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124850288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clustering Throughput Optimization on the GPU GPU上的集群吞吐量优化
M. Gowanlock, C. Rude, D. M. Blair, Justin D. Li, V. Pankratius
{"title":"Clustering Throughput Optimization on the GPU","authors":"M. Gowanlock, C. Rude, D. M. Blair, Justin D. Li, V. Pankratius","doi":"10.1109/IPDPS.2017.17","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.17","url":null,"abstract":"Large datasets in astronomy and geoscience often require clustering and visualizations of phenomena at different densities and scales in order to generate scientific insight. We examine the problem of maximizing clustering throughput for concurrent dataset clustering in spatial dimensions. We introduce a novel hybrid approach that uses GPUs in conjunction with multicore CPUs for algorithmic throughput optimizations. The key idea is to exploit the fast memory on the GPU for index searches and optimize I/O transfers in such a way that the low-bandwidth host-GPU bottleneck does not have a significant negative performance impact. To achieve this, we derive two distinct GPU kernels that exploit grid-based indexing schemes to improve clustering performance. To obviate limited GPU memory and enable large dataset clustering, our method is complemented by an efficient batching scheme for transfers between the host and GPU accelerator. This scheme is robust with respect to both sparse and dense data distributions and intelligently avoids buffer overflows that would otherwise degrade performance, all while minimizing the number of data transfers between the host and GPU. We evaluate our approaches on ionospheric total electron content datasets as well as intermediate-redshift galaxies from the Sloan Digital Sky Survey. Our hybrid approach yields a speedup of up to 50x over the sequential implementation on one of the experimental scenarios, which is respectable for I/O intensive clustering.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127835295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
MetaKV: A Key-Value Store for Metadata Management of Distributed Burst Buffers MetaKV:用于分布式突发缓冲区元数据管理的键值存储
Teng Wang, A. Moody, Yue Zhu, K. Mohror, Kento Sato, T. Islam, Weikuan Yu
{"title":"MetaKV: A Key-Value Store for Metadata Management of Distributed Burst Buffers","authors":"Teng Wang, A. Moody, Yue Zhu, K. Mohror, Kento Sato, T. Islam, Weikuan Yu","doi":"10.1109/IPDPS.2017.39","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.39","url":null,"abstract":"Distributed burst buffers are a promising storage architecture for handling I/O workloads for exascale computing. Their aggregate storage bandwidth grows linearly with system node count. However, although scientific applications can achieve scalable write bandwidth by having each process write to its node-local burst buffer, metadata challenges remain formidable, especially for files shared across many processes. This is due to the need to track and organize file segments across the distributed burst buffers in a global index. Because this global index can be accessed concurrently by thousands or more processes in a scientific application, the scalability of metadata management is a severe performance-limiting factor. In this paper, we propose MetaKV: a key-value store that provides fast and scalable metadata management for HPC metadata workloads on distributed burst buffers. MetaKV complements the functionality of an existing key-value store with specialized metadata services that efficiently handle bursty and concurrent metadata workloads: compressed storage management, supervised block clustering, and log-ring based collective message reduction. Our experiments demonstrate that MetaKV outperforms the state-of-the-art key-value stores by a significant margin. It improves put and get metadata operations by as much as 2.66× and 6.29×, respectively, and the benefits of MetaKV increase with increasing metadata workload demand.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129041877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
PUNAS: A Parallel Ungapped-Alignment-Featured Seed Verification Algorithm for Next-Generation Sequencing Read Alignment 新一代测序读比对的并行unapped -Alignment种子验证算法
Yuandong Chan, Kai Xu, Haidong Lan, Weiguo Liu, Yongchao Liu, B. Schmidt
{"title":"PUNAS: A Parallel Ungapped-Alignment-Featured Seed Verification Algorithm for Next-Generation Sequencing Read Alignment","authors":"Yuandong Chan, Kai Xu, Haidong Lan, Weiguo Liu, Yongchao Liu, B. Schmidt","doi":"10.1109/IPDPS.2017.35","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.35","url":null,"abstract":"The progress of next-generation sequencing has a major impact on medical and genomic research. This technology can now produce billions of short DNA fragments (reads) in a single run. One of the most demanding computational problems used by almost every sequencing pipeline is short-read alignment; i.e. determining where each fragment originated from in the original genome. Most current solutions are based on a seed-and-extend approach, where promising candidate regions (seeds) are first identified and subsequently extended in order to verify whether a full high-scoring alignment actually exists in the vicinity of each seed. Seed verification is the main bottleneck in many state-of-the-art aligners and thus finding fast solutions is of high importance. We present a parallel ungapped-alignment-featured seed verification (PUNAS) algorithm, a fast filter for effectively removing the majority of false positive seeds, thus significantly accelerating the short-read alignment process. PUNAS is based on bit-parallelism and takes advantage of SIMD vector units of modern microprocessors. Our implementation employs a vectorize-and-scale approach supporting multi-core CPUs and many-core Knights Landing (KNL)-based Xeon Phi processors. Performance evaluation reveals that PUNAS is over three orders-of-magnitude faster than seed verification with the Smith-Waterman algorithm and around one order-of-magnitude faster than seed verification with the banded version of Myers bit-vector algorithm. Using a single thread it achieves a speedup of up to 7.3, 27.1, and 11.6 compared to the shifted Hamming distance filter on a SSE, AVX2, and AVX-512 based CPU/KNL, respectively. The speed of our framework further scales almost linearly with the number of cores. PUNAS is open-source software available at https://github.com/Xu-Kai/PUNASfilter.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130497644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Addressing Performance Heterogeneity in MapReduce Clusters with Elastic Tasks 使用弹性任务解决MapReduce集群的性能异构问题
Wei Chen, J. Rao, Xiaobo Zhou
{"title":"Addressing Performance Heterogeneity in MapReduce Clusters with Elastic Tasks","authors":"Wei Chen, J. Rao, Xiaobo Zhou","doi":"10.1109/IPDPS.2017.28","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.28","url":null,"abstract":"MapReduce applications, which require access to a large number of computing nodes, are commonly deployed in heterogeneous environments. The performance discrepancy between individual nodes in a heterogeneous cluster present significant challenges to attain good performance in MapReduce jobs. MapReduce implementations designed and optimized for homogeneous environments perform poorly on heterogeneous clusters. We attribute suboptimal performance in heterogeneous clusters to significant load imbalance between map tasks. We identify two MapReduce designs that hinder load balancing: (1) static binding between mappers and their data makes it difficult to exploit data redundancy for load balancing; (2) uniform map sizes is not optimal for nodes with heterogeneous performance. To address these issues, we propose FlexMap, a user-transparent approach that dynamically provisions map tasks to match distinct machine capacity in heterogeneous environments. We implemented FlexMap in Hadoop-2.6.0. Experimental results show that it reduces job completion time by as much as 40% compared to stock Hadoop and 30% to SkewTune.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132444552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信