SC14: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

筛选
英文 中文
Anton 2: Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer Anton 2:在专用分子动力学超级计算机中提高性能和可编程性的标准
D. E. Shaw, J. P. Grossman, Joseph A. Bank, Brannon Batson, J. A. Butts, Jack C. Chao, Martin M. Deneroff, R. Dror, Amos Even, Christopher H. Fenton, Anthony Forte, Joseph Gagliardo, Gennette Gill, Brian Greskamp, R. Ho, D. Ierardi, Lev Iserovich, J. Kuskin, Richard H. Larson, T. Layman, L. Lee, Adam K. Lerer, Chester Li, Daniel Killebrew, Kenneth M. Mackenzie, Shark Yeuk-Hai Mok, Mark A. Moraes, Rolf Mueller, Lawrence J. Nociolo, Jon L. Peticolas, Terry Quan, D. Ramot, J. Salmon, D. Scarpazza, U. Schafer, Naseer Siddique, Christopher W. Snyder, Jochen Spengler, P. T. P. Tang, Michael Theobald, Horia Toma, Brian Towles, B. Vitale, Stanley C. Wang, C. Young
{"title":"Anton 2: Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer","authors":"D. E. Shaw, J. P. Grossman, Joseph A. Bank, Brannon Batson, J. A. Butts, Jack C. Chao, Martin M. Deneroff, R. Dror, Amos Even, Christopher H. Fenton, Anthony Forte, Joseph Gagliardo, Gennette Gill, Brian Greskamp, R. Ho, D. Ierardi, Lev Iserovich, J. Kuskin, Richard H. Larson, T. Layman, L. Lee, Adam K. Lerer, Chester Li, Daniel Killebrew, Kenneth M. Mackenzie, Shark Yeuk-Hai Mok, Mark A. Moraes, Rolf Mueller, Lawrence J. Nociolo, Jon L. Peticolas, Terry Quan, D. Ramot, J. Salmon, D. Scarpazza, U. Schafer, Naseer Siddique, Christopher W. Snyder, Jochen Spengler, P. T. P. Tang, Michael Theobald, Horia Toma, Brian Towles, B. Vitale, Stanley C. Wang, C. Young","doi":"10.1109/SC.2014.9","DOIUrl":"https://doi.org/10.1109/SC.2014.9","url":null,"abstract":"Anton 2 is a second-generation special-purpose supercomputer for molecular dynamics simulations that achieves significant gains in performance, programmability, and capacity compared to its predecessor, Anton 1. The architecture of Anton 2 is tailored for fine-grained event-driven operation, which improves performance by increasing the overlap of computation with communication, and also allows a wider range of algorithms to run efficiently, enabling many new software-based optimizations. A 512-node Anton 2 machine, currently in operation, is up to ten times faster than Anton 1 with the same number of nodes, greatly expanding the reach of all-atom bio molecular simulations. Anton 2 is the first platform to achieve simulation rates of multiple microseconds of physical time per day for systems with millions of atoms. Demonstrating strong scaling, the machine simulates a standard 23,558-atom benchmark system at a rate of 85 μs/day -- 180 times faster than any commodity hardware platform or general-purpose supercomputer.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125620113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 455
Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications 图形应用gpu上的快速稀疏矩阵向量乘法
Arash Ashari, N. Sedaghati, John Eisenlohr, S. Parthasarathy, P. Sadayappan
{"title":"Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications","authors":"Arash Ashari, N. Sedaghati, John Eisenlohr, S. Parthasarathy, P. Sadayappan","doi":"10.1109/SC.2014.69","DOIUrl":"https://doi.org/10.1109/SC.2014.69","url":null,"abstract":"Sparse matrix-vector multiplication (SpMV) is a widely used computational kernel. The most commonly used format for a sparse matrix is CSR (Compressed Sparse Row), but a number of other representations have recently been developed that achieve higher SpMV performance. However, the alternative representations typically impose a significant preprocessing overhead. While a high preprocessing overhead can be amortized for applications requiring many iterative invocations of SpMV that use the same matrix, it is not always feasible -- for instance when analyzing large dynamically evolving graphs. This paper presents ACSR, an adaptive SpMV algorithm that uses the standard CSR format but reduces thread divergence by combining rows into groups (bins) which have a similar number of non-zero elements. Further, for rows in bins that span a wide range of non zero counts, dynamic parallelism is leveraged. A significant benefit of ACSR over other proposed SpMV approaches is that it works directly with the standard CSR format, and thus avoids significant preprocessing overheads. A CUDA implementation of ACSR is shown to outperform SpMV implementations in the NVIDIA CUSP and cuSPARSE libraries on a set of sparse matrices representing power-law graphs. We also demonstrate the use of ACSR for the analysis of dynamic graphs, where the improvement over extant approaches is even higher.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115404468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 135
Slim Fly: A Cost Effective Low-Diameter Network Topology 瘦飞:一种低成本的低直径网络拓扑结构
Maciej Besta, T. Hoefler
{"title":"Slim Fly: A Cost Effective Low-Diameter Network Topology","authors":"Maciej Besta, T. Hoefler","doi":"10.1109/SC.2014.34","DOIUrl":"https://doi.org/10.1109/SC.2014.34","url":null,"abstract":"We introduce a high-performance cost-effective network topology called Slim Fly that approaches the theoretically optimal network diameter. Slim Fly is based on graphs that approximate the solution to the degree-diameter problem. We analyze Slim Fly and compare it to both traditional and state-of the-art networks. Our analysis shows that Slim Fly has significant advantages over other topologies in latency, bandwidth, resiliency, cost, and power consumption. Finally, we propose deadlock-free routing schemes and physical layouts for large computing centres as well as a detailed cost and power model. Slim Fly enables constructing cost effective and highly resilient data enter and HPC networks that offer low latency and high bandwidth under different HPC workloads such as stencil or graph computations.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128830911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 243
Parallel Bayesian Network Structure Learning for Genome-Scale Gene Networks 基因组尺度基因网络的并行贝叶斯网络结构学习
Sanchit Misra, Md. Vasimuddin, K. Pamnany, Sriram P. Chockalingam, Yong Dong, Min Xie, M. Aluru, S. Aluru
{"title":"Parallel Bayesian Network Structure Learning for Genome-Scale Gene Networks","authors":"Sanchit Misra, Md. Vasimuddin, K. Pamnany, Sriram P. Chockalingam, Yong Dong, Min Xie, M. Aluru, S. Aluru","doi":"10.1109/SC.2014.43","DOIUrl":"https://doi.org/10.1109/SC.2014.43","url":null,"abstract":"Learning Bayesian networks is NP-hard. Even with recent progress in heuristic and parallel algorithms, modeling capabilities still fall short of the scale of the problems encountered. In this paper, we present a massively parallel method for Bayesian network structure learning, and demonstrate its capability by constructing genome-scale gene networks of the model plant Arabidopsis thaliana from over 168.5 million gene expression values. We report strong scaling efficiency of 75% and demonstrate scaling to 1.57 million cores of the Tianhe-2 supercomputer. Our results constitute three and five orders of magnitude increase over previously published results in the scale of data analyzed and computations performed, respectively. We achieve this through algorithmic innovations, using efficient techniques to distribute work across all compute nodes, all available processors and coprocessors on each node, all available threads on each processor and coprocessor, and vectorization techniques to maximize single thread performance.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"195 S556","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132905261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Understanding the Effects of Communication and Coordination on Checkpointing at Scale 理解沟通和协调对大规模检查点的影响
Kurt B. Ferreira, Patrick M. Widener, Scott Levy, D. Arnold, T. Hoefler
{"title":"Understanding the Effects of Communication and Coordination on Checkpointing at Scale","authors":"Kurt B. Ferreira, Patrick M. Widener, Scott Levy, D. Arnold, T. Hoefler","doi":"10.1109/SC.2014.77","DOIUrl":"https://doi.org/10.1109/SC.2014.77","url":null,"abstract":"Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid check pointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. However, few insights into selection and tuning of these protocols for applications at scale have emerged. In this paper, we use a simulation-based approach to show that local checkpoint activity in resilience mechanisms can significantly affect the performance of key workloads, even when less than 1% of a local node's compute time is allocated to resilience mechanisms (a very generous assumption). Specifically, we show that even though much work on uncoordinated check pointing has focused on optimizing message log volumes, local check pointing activity may dominate the overheads of this technique at scale. Our study shows that local checkpoints lead to process delays that can propagate through messaging relations to other processes causing a cascading series of delays. We demonstrate how to tune hierarchical uncoordinated check pointing protocols designed to reduce log volumes to significantly reduce these synchronization overheads at scale. Our work provides a critical analysis and comparison of coordinated and uncoordinated check pointing and enables users and system administrators to fine-tune the check pointing scheme to the application and system characteristics.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132020395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications MC-Checker:检测内存一致性错误在MPI单边应用程序
Zhezhe Chen, James Dinan, Zhen Tang, P. Balaji, Hua Zhong, Jun Wei, Tao Huang, Feng Qin
{"title":"MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications","authors":"Zhezhe Chen, James Dinan, Zhen Tang, P. Balaji, Hua Zhong, Jun Wei, Tao Huang, Feng Qin","doi":"10.1109/SC.2014.46","DOIUrl":"https://doi.org/10.1109/SC.2014.46","url":null,"abstract":"One-sided communication decouples data movement and synchronization by providing support for asynchronous reads and updates of distributed shared data. While such interfaces can be extremely efficient, they also impose challenges in properly performing asynchronous accesses to shared data. This paper presents MC-Checker, a new tool that detects memory consistency errors in MPI one-sided applications. MCChecker first performs online instrumentation and captures relevant dynamic events, such as one-sided communications and load/store operations. MC-Checker then performs analysis to detect memory consistency errors. When found, errors are reported along with useful diagnostic information. Experiments indicate that MC-Checker is effective at detecting and diagnosing memory consistency bugs in MPI one-sided applications, with low overhead, ranging from 24.6% to 71.1%, with an average of 45.2%.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132992324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Microbank: Architecting Through-Silicon Interposer-Based Main Memory Systems 微库:基于硅介层的主存系统架构
Y. Son, O. Seongil, Hyunggyun Yang, Daejin Jung, Jung Ho Ahn, John Kim, Jangwoo Kim, Jae W. Lee
{"title":"Microbank: Architecting Through-Silicon Interposer-Based Main Memory Systems","authors":"Y. Son, O. Seongil, Hyunggyun Yang, Daejin Jung, Jung Ho Ahn, John Kim, Jangwoo Kim, Jae W. Lee","doi":"10.1109/SC.2014.91","DOIUrl":"https://doi.org/10.1109/SC.2014.91","url":null,"abstract":"Through-Silicon Interposer (TSI) has recently been proposed to provide high memory bandwidth and improve energy efficiency of the main memory system. However, the impact of TSI on main memory system architecture has not been well explored. While TSI improves the I/O energy efficiency, we show that it results in an unbalanced memory system design in terms of energy efficiency as the core DRAM dominates overall energy consumption. To balance and enhance the energy efficiency of a TSI-based memory system, we propose μbank, a novel DRAM device organization in which each bank is partitioned into multiple smaller banks (or μbanks) that operate independently like conventional banks with minimal area overhead. The μbank organization significantly increases the amount of bank-level parallelism to improve the performance and energy efficiency of the TSI-based memory system. The massive number of μbanks reduces bank conflicts, hence simplifying the memory system design. We evaluated a sophisticated prediction-based DRAM page-management policy, which can improve performance by up to 20.5% in a conventional memory system without μbanks. However, a μbank-based design does not require such a complex page-management policy and a simple open-page policy is often sufficient -- achieving within 5% of a perfect predictor. Our proposed μbank-based memory system improves the IPC and system energy-delay product by 1.62× and 4.80×, respectively, for memory-intensive SPEC 2006 benchmarks on average, over the baseline DDR3-based memory system.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"5 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113961714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters Xeon Phi集群节点内和节点间卸载的统一编程模型
M. Noack, Florian Wende, T. Steinke, F. Cordes
{"title":"A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters","authors":"M. Noack, Florian Wende, T. Steinke, F. Cordes","doi":"10.1109/SC.2014.22","DOIUrl":"https://doi.org/10.1109/SC.2014.22","url":null,"abstract":"Standard offload programming models for the Xeon Phi, e.g. Intel LEO and OpenMP 4.0, are restricted to a single compute node and hence a limited number of coprocessors. Scaling applications across a Xeon Phi cluster/supercomputer thus requires hybrid programming approaches, usually MPI+X. In this work, we present a framework based on heterogeneous active messages (HAM-Offload) that provides the means to offload work to local and remote (co)processors using a unified offload API. Since HAM-Offload provides similar primitives as current local offload frameworks, existing applications can be easily ported to overcome the single-node limitation while keeping the convenient offload programming model. We demonstrate the effectiveness of the framework by using it to enable a real-world application from the field of molecular dynamics to use multiple local and remote Xeon Phis. The evaluation shows good scaling behavior. Compared with LEO, performance is equal for large offloads and significantly better for small offloads.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123865643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Scaling MapReduce Vertically and Horizontally 垂直和水平缩放MapReduce
I. El-Helw, Rutger F. H. Hofman, H. Bal
{"title":"Scaling MapReduce Vertically and Horizontally","authors":"I. El-Helw, Rutger F. H. Hofman, H. Bal","doi":"10.1109/SC.2014.48","DOIUrl":"https://doi.org/10.1109/SC.2014.48","url":null,"abstract":"Glass wing is a MapReduce framework that uses OpenCL to exploit multi-core CPUs and accelerators. However, compute device capabilities may vary significantly and require targeted optimization. Similarly, the availability of resources such as memory, storage and interconnects can severely impact overall job performance. In this paper, we present and analyze how MapReduce applications can improve their horizontal and vertical scalability by using a well controlled mixture of coarse- and fine-grained parallelism. Specifically, we discuss the Glass wing pipeline and its ability to overlap computation, communication, memory transfers and disk access. Additionally, we show how Glass wing can adapt to the distinct capabilities of a variety of compute devices by employing fine-grained parallelism. We experimentally evaluated the performance of five MapReduce applications and show that Glass wing outperforms Hadoop on a 64-node multi-core CPU cluster by factors between 1.2 and 4, and factors from 20 to 30 on a 23-node GPU cluster. Similarly, we show that Glass wing is at least 1.5 times faster than GPMR on the GPU cluster.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129018709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
High-Performance Computation of Distributed-Memory Parallel 3D Voronoi and Delaunay Tessellation 分布式存储并行三维Voronoi和Delaunay镶嵌的高性能计算
T. Peterka, D. Morozov, C. L. Phillips
{"title":"High-Performance Computation of Distributed-Memory Parallel 3D Voronoi and Delaunay Tessellation","authors":"T. Peterka, D. Morozov, C. L. Phillips","doi":"10.1109/SC.2014.86","DOIUrl":"https://doi.org/10.1109/SC.2014.86","url":null,"abstract":"Computing a Voronoi or Delaunay tessellation from a set of points is a core part of the analysis of many simulated and measured datasets: N-body simulations, molecular dynamics codes, and LIDAR point clouds are just a few examples. Such computational geometry methods are common in data analysis and visualization, but as the scale of simulations and observations surpasses billions of particles, the existing serial and shared memory algorithms no longer suffice. A distributed-memory scalable parallel algorithm is the only feasible approach. The primary contribution of this paper is a new parallel Delaunay and Voronoi tessellation algorithm that automatically determines which neighbor points need to be exchanged among the sub domains of a spatial decomposition. Other contributions include periodic and wall boundary conditions, comparison of our method using two popular serial libraries, and application to numerous science datasets.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129292329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信