SC14: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

筛选
英文 中文
DISC: A Domain-Interaction Based Programming Model with Support for Heterogeneous Execution 支持异构执行的基于域交互的编程模型
Mehmet Can Kurt, G. Agrawal
{"title":"DISC: A Domain-Interaction Based Programming Model with Support for Heterogeneous Execution","authors":"Mehmet Can Kurt, G. Agrawal","doi":"10.1109/SC.2014.76","DOIUrl":"https://doi.org/10.1109/SC.2014.76","url":null,"abstract":"Several emerging trends are pointing to increasing heterogeneity among nodes and/or cores in HPC systems. Existing programming models, especially for distributed memory execution, typically have been designed to facilitate high performance on homogeneous systems. This paper describes a programming model and an associated runtime system we have developed to address the above need. The main concepts in the programming model are that of a domain and interactions between the domain elements. We explain how stencil computations, unstructured grid computations, and molecular dynamics applications can be expressed using these simple concepts. We show how interprocess communication can be handled efficiently at runtime just from the knowledge of domain interaction, for different types of applications. Subsequently, we develop techniques for the runtime system to automatically partition and re-partition the work among heterogeneous processors or nodes.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130870531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
CYPRESS: Combining Static and Dynamic Analysis for Top-Down Communication Trace Compression CYPRESS:结合静态和动态分析的自顶向下通信跟踪压缩
Jidong Zhai, Jianfei Hu, Xiongchao Tang, Xiaosong Ma, Wenguang Chen
{"title":"CYPRESS: Combining Static and Dynamic Analysis for Top-Down Communication Trace Compression","authors":"Jidong Zhai, Jianfei Hu, Xiongchao Tang, Xiaosong Ma, Wenguang Chen","doi":"10.1109/SC.2014.17","DOIUrl":"https://doi.org/10.1109/SC.2014.17","url":null,"abstract":"Communication traces are increasingly important, both for parallel applications' performance analysis/optimization, and for designing next-generation HPC systems. Meanwhile, the problem size and the execution scale on supercomputers keep growing, producing prohibitive volume of communication traces. To reduce the size of communication traces, existing dynamic compression methods introduce large compression overhead with the job scale. We propose a hybrid static-dynamic method that leverages information acquired from static analysis to facilitate more effective and efficient dynamic trace compression. Our proposed scheme, Cypress, extracts a program communication structure tree at compile time using inter-procedural analysis. This tree naturally contains crucial iterative computing features such as the loop structure, allowing subsequent runtime compression to \"fill in\", in a \"top-down\" manner, event details into the known communication template. Results show that Cypress reduces intra-process and inter-process compression overhead up to 5× and 9× respectively over state-of-the-art dynamic methods, while only introducing very low compiling overhead.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131329974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Fail-in-Place Network Design: Interaction Between Topology, Routing Algorithm and Failures 故障就地网络设计:拓扑、路由算法和故障之间的交互作用
Jens Domke, T. Hoefler, S. Matsuoka
{"title":"Fail-in-Place Network Design: Interaction Between Topology, Routing Algorithm and Failures","authors":"Jens Domke, T. Hoefler, S. Matsuoka","doi":"10.1109/SC.2014.54","DOIUrl":"https://doi.org/10.1109/SC.2014.54","url":null,"abstract":"The growing system size of high performance computers results in a steady decrease of the mean time between failures. Exchanging network components often requires whole system downtime which increases the cost of failures. In this work, we study a fail-in-place strategy where broken network elements remain untouched. We show, that a fail-in-place strategy is feasible for todays networks and the degradation is manageable, and provide guidelines for the design. Our network failure simulation tool chain allows system designers to extrapolate the performance degradation based on expected failure rates, and it can be used to evaluate the current state of a system. In a case study of real-world HPC systems, we will analyze the performance degradation throughout the systems lifetime under the assumption that faulty network components are not repaired, which results in a recommendation to change the used routing algorithm to improve the network performance as well as the fail-in-place characteristic.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117353599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Compiler Techniques for Massively Scalable Implicit Task Parallelism 大规模可扩展隐式任务并行的编译器技术
Timothy G. Armstrong, J. Wozniak, M. Wilde, Ian T Foster
{"title":"Compiler Techniques for Massively Scalable Implicit Task Parallelism","authors":"Timothy G. Armstrong, J. Wozniak, M. Wilde, Ian T Foster","doi":"10.1109/SC.2014.30","DOIUrl":"https://doi.org/10.1109/SC.2014.30","url":null,"abstract":"Swift/T is a high-level language for writing concise, deterministic scripts that compose serial or parallel codes implemented in lower-level programming models into large-scale parallel applications. It executes using a data-driven task parallel execution model that is capable of orchestrating millions of concurrently executing asynchronous tasks on homogeneous or heterogeneous resources. Producing code that executes efficiently at this scale requires sophisticated compiler transformations: poorly optimized code inhibits scaling with excessive synchronization and communication. We present a comprehensive set of compiler techniques for data-driven task parallelism, including novel compiler optimizations and intermediate representations. We report application benchmark studies, including unbalanced tree search and simulated annealing, and demonstrate that our techniques greatly reduce communication overhead and enable extreme scalability, distributing up to 612 million dynamically load balanced tasks per second at scales of up to 262,144 cores without explicit parallelism, synchronization, or load balancing in application code.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116861617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Nonblocking Epochs in MPI One-Sided Communication MPI单侧通信中的非阻塞时代
Judicael A. Zounmevo, Xin Zhao, P. Balaji, W. Gropp, A. Afsahi
{"title":"Nonblocking Epochs in MPI One-Sided Communication","authors":"Judicael A. Zounmevo, Xin Zhao, P. Balaji, W. Gropp, A. Afsahi","doi":"10.1109/SC.2014.44","DOIUrl":"https://doi.org/10.1109/SC.2014.44","url":null,"abstract":"The synchronization model of the MPI one-sided communication paradigm can lead to serialization and latency propagation. For instance, a process can propagate non-RMA communication-related latencies to remote peers waiting in their respective epoch-closing routines in matching epochs. In this work, we discuss six latency issues that were documented for MPI-2.0 and show how they evolved in MPI-3.0. Then, we propose entirely nonblocking RMA synchronizations that allow processes to avoid waiting even in epoch-closing routines. The proposal provides contention avoidance in communication patterns that require back to back RMA epochs. It also fixes the latency propagation issues. Moreover, it allows the MPI progress engine to orchestrate aggressive schedulings to cut down the overall completion time of sets of epochs without introducing memory consistency hazards. Our test results show noticeable performance improvements for a lower-upper matrix decomposition as well as an application pattern that performs massive atomic updates.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127503766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Computation- and Communication-Optimal Parallel Direct 3-Body Algorithm 一种计算和通信最优的并行直接三体算法
Penporn Koanantakool, K. Yelick
{"title":"A Computation- and Communication-Optimal Parallel Direct 3-Body Algorithm","authors":"Penporn Koanantakool, K. Yelick","doi":"10.1109/SC.2014.35","DOIUrl":"https://doi.org/10.1109/SC.2014.35","url":null,"abstract":"Traditional particle simulation methods are used to calculate pair wise potentials, but some problems require 3-body potentials that calculate over triplets of particles. A direct calculation of 3-body interactions involves O(n3) interactions, but has significant redundant computations that occur in a nested loop formulation. In this paper we explore algorithms for 3-body computations that simultaneously optimize three criteria: computation minimization through symmetries, communication optimality, and load balancing. We present a new 3-body algorithm that is both communication and computation optimal. Its optional replication factor, c, saves c3 in latency (number of messages) and c2 in bandwidth (volume), with bounded load imbalance. We also consider the k-body case and discuss an algorithm that is optimal if there is a cut off distance of less than 1/3 of the domain. The 3-body algorithm demonstrates 99% efficiency on tens of thousands of cores, showing strong scaling properties with order of magnitude speedups over the nïve algorithm.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129996432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Two-Choice Randomized Dynamic I/O Scheduler for Object Storage Systems 对象存储系统的双选择随机动态I/O调度程序
Dong Dai, Yong Chen, D. Kimpe, R. Ross
{"title":"Two-Choice Randomized Dynamic I/O Scheduler for Object Storage Systems","authors":"Dong Dai, Yong Chen, D. Kimpe, R. Ross","doi":"10.1109/SC.2014.57","DOIUrl":"https://doi.org/10.1109/SC.2014.57","url":null,"abstract":"Object storage is considered a promising solution for next-generation (exascale) high-performance computing platform because of its flexible and high-performance object interface. However, delivering high burst-write throughput is still a critical challenge. Although deploying more storage servers can potentially provide higher throughput, it can be ineffective because the burst-write throughput can be limited by a small number of stragglers (storage servers that are occasionally slower than others). In this paper, we propose a two-choice randomized dynamic I/O scheduler that schedules the concurrent burst-write operations in a balanced way to avoid stragglers and hence achieve high throughput. The contributions in this study are threefold. First, we propose a two-choice randomized dynamic I/O scheduler with collaborative probe and preassign strategies. Second, we design and implement a redirect table and metadata maintainer to address the metadata management challenge introduced by dynamic I/O scheduling. Third, we evaluate the proposed scheduler with both simulation tests and experimental tests in an HPC cluster. The evaluation results confirm the scalability and performance benefits of the proposed I/O scheduler.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"251 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121341660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
The DRIHM Project: A Flexible Approach to Integrate HPC, Grid and Cloud Resources for Hydro-Meteorological Research DRIHM项目:整合HPC、网格和云资源用于水文气象研究的灵活方法
D. D'Agostino, A. Clematis, A. Galizia, A. Quarati, E. Danovaro, Luca Roverelli, Gabriele Zereik, D. Kranzlmüller, Michael Schiffers, N. Felde, Christian Straube, Olivier Caumontz, E. Richard, L. Garrote, Quillon Harphamk, H.R.A. Jagers, V. Dimitrijevic, L. Dekic, Elisabetta Fiorizz, F. Delogu, A. Parodi
{"title":"The DRIHM Project: A Flexible Approach to Integrate HPC, Grid and Cloud Resources for Hydro-Meteorological Research","authors":"D. D'Agostino, A. Clematis, A. Galizia, A. Quarati, E. Danovaro, Luca Roverelli, Gabriele Zereik, D. Kranzlmüller, Michael Schiffers, N. Felde, Christian Straube, Olivier Caumontz, E. Richard, L. Garrote, Quillon Harphamk, H.R.A. Jagers, V. Dimitrijevic, L. Dekic, Elisabetta Fiorizz, F. Delogu, A. Parodi","doi":"10.1109/SC.2014.49","DOIUrl":"https://doi.org/10.1109/SC.2014.49","url":null,"abstract":"The distributed research infrastructure for hydrometeorology (DRIHM) project focuses on the development of an e-Science infrastructure to provide end-to-end hydro meteorological research (HMR) services (models, data, and post processing tools) by exploiting HPC, Grid and Cloud facilities. In particular, the DRIHM infrastructure supports the execution and analysis of high-resolution simulations through the definition of workflows composed by heterogeneous HMR models in a scalable and interoperable way, while hiding all the low level complexities. This contribution gives insights into best practices adopted to satisfy the requirements of an emerging multidisciplinary scientific community composed of earth and atmospheric scientists. To this end, DRIHM supplies innovative services leveraging high performance and distributed computing resources. Hydro meteorological requirements shape this IT infrastructure through an iterative \"learning-by-doing\" approach that permits tight interactions between the application community and computer scientists, leading to the development of a flexible, extensible, and interoperable framework.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"172 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121558063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Fast Parallel Computation of Longest Common Prefixes 最长公共前缀的快速并行计算
Julian Shun
{"title":"Fast Parallel Computation of Longest Common Prefixes","authors":"Julian Shun","doi":"10.1109/SC.2014.37","DOIUrl":"https://doi.org/10.1109/SC.2014.37","url":null,"abstract":"Suffix arrays and the corresponding longest common prefix (LCP) array have wide applications in bioinformatics, information retrieval and data compression. In this work, we propose and theoretically analyze new parallel algorithms for computing the LCP array given the suffix array as input. Most of our algorithms have a work and depth (parallel time) complexity related to the LCP values of the input. We also present a slight variation of Kärkkäinen and Sanders' skew algorithm that requires linear work and poly-logarithmic depth in the worst case. We present a comprehensive experimental study of our parallel algorithms along with existing parallel and sequential LCP algorithms. On a variety of real-world and artificial strings, we show that on a 40-core shared-memory machine our fastest algorithm is up to 2.3 times faster than the fastest existing parallel algorithm, and up to 21.8 times faster than the fastest sequential LCP algorithm.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"45 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123135075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Lattice QCD with Domain Decomposition on Intel® Xeon Phi Co-Processors 基于Intel®Xeon Phi协处理器的点阵QCD域分解
S. Heybrock, B. Joó, Dhiraj D. Kalamkar, M. Smelyanskiy, K. Vaidyanathan, T. Wettig, P. Dubey
{"title":"Lattice QCD with Domain Decomposition on Intel® Xeon Phi Co-Processors","authors":"S. Heybrock, B. Joó, Dhiraj D. Kalamkar, M. Smelyanskiy, K. Vaidyanathan, T. Wettig, P. Dubey","doi":"10.1109/SC.2014.11","DOIUrl":"https://doi.org/10.1109/SC.2014.11","url":null,"abstract":"The gap between the cost of moving data and the cost of computing continues to grow, making it ever harder to design iterative solvers on extreme-scale architectures. This problem can be alleviated by alternative algorithms that reduce the amount of data movement. We investigate this in the context of Lattice Quantum Chromo dynamics and implement such an alternative solver algorithm, based on domain decomposition, on Intel® Xeon Phi co-processor (KNC) clusters. We demonstrate close-to-linear on-chip scaling to all 60 cores of the KNC. With a mix of single- and half-precision the domain-decomposition method sustains 400-500 Gflop/s per chip. Compared to an optimized KNC implementation of a standard solver [1], our full multi-node domain-decomposition solver strong-scales to more nodes and reduces the time-to-solution by a factor of 5.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115955948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信