SC14: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献_第3页

Anton 2: Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer Anton 2:在专用分子动力学超级计算机中提高性能和可编程性的标准

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.9

D. E. Shaw, J. P. Grossman, Joseph A. Bank, Brannon Batson, J. A. Butts, Jack C. Chao, Martin M. Deneroff, R. Dror, Amos Even, Christopher H. Fenton, Anthony Forte, Joseph Gagliardo, Gennette Gill, Brian Greskamp, R. Ho, D. Ierardi, Lev Iserovich, J. Kuskin, Richard H. Larson, T. Layman, L. Lee, Adam K. Lerer, Chester Li, Daniel Killebrew, Kenneth M. Mackenzie, Shark Yeuk-Hai Mok, Mark A. Moraes, Rolf Mueller, Lawrence J. Nociolo, Jon L. Peticolas, Terry Quan, D. Ramot, J. Salmon, D. Scarpazza, U. Schafer, Naseer Siddique, Christopher W. Snyder, Jochen Spengler, P. T. P. Tang, Michael Theobald, Horia Toma, Brian Towles, B. Vitale, Stanley C. Wang, C. Young

{"title":"Anton 2: Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer","authors":"D. E. Shaw, J. P. Grossman, Joseph A. Bank, Brannon Batson, J. A. Butts, Jack C. Chao, Martin M. Deneroff, R. Dror, Amos Even, Christopher H. Fenton, Anthony Forte, Joseph Gagliardo, Gennette Gill, Brian Greskamp, R. Ho, D. Ierardi, Lev Iserovich, J. Kuskin, Richard H. Larson, T. Layman, L. Lee, Adam K. Lerer, Chester Li, Daniel Killebrew, Kenneth M. Mackenzie, Shark Yeuk-Hai Mok, Mark A. Moraes, Rolf Mueller, Lawrence J. Nociolo, Jon L. Peticolas, Terry Quan, D. Ramot, J. Salmon, D. Scarpazza, U. Schafer, Naseer Siddique, Christopher W. Snyder, Jochen Spengler, P. T. P. Tang, Michael Theobald, Horia Toma, Brian Towles, B. Vitale, Stanley C. Wang, C. Young","doi":"10.1109/SC.2014.9","DOIUrl":"https://doi.org/10.1109/SC.2014.9","url":null,"abstract":"Anton 2 is a second-generation special-purpose supercomputer for molecular dynamics simulations that achieves significant gains in performance, programmability, and capacity compared to its predecessor, Anton 1. The architecture of Anton 2 is tailored for fine-grained event-driven operation, which improves performance by increasing the overlap of computation with communication, and also allows a wider range of algorithms to run efficiently, enabling many new software-based optimizations. A 512-node Anton 2 machine, currently in operation, is up to ten times faster than Anton 1 with the same number of nodes, greatly expanding the reach of all-atom bio molecular simulations. Anton 2 is the first platform to achieve simulation rates of multiple microseconds of physical time per day for systems with millions of atoms. Demonstrating strong scaling, the machine simulates a standard 23,558-atom benchmark system at a rate of 85 μs/day -- 180 times faster than any commodity hardware platform or general-purpose supercomputer.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125620113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 455

Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications 图形应用gpu上的快速稀疏矩阵向量乘法

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.69

Arash Ashari, N. Sedaghati, John Eisenlohr, S. Parthasarathy, P. Sadayappan

{"title":"Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications","authors":"Arash Ashari, N. Sedaghati, John Eisenlohr, S. Parthasarathy, P. Sadayappan","doi":"10.1109/SC.2014.69","DOIUrl":"https://doi.org/10.1109/SC.2014.69","url":null,"abstract":"Sparse matrix-vector multiplication (SpMV) is a widely used computational kernel. The most commonly used format for a sparse matrix is CSR (Compressed Sparse Row), but a number of other representations have recently been developed that achieve higher SpMV performance. However, the alternative representations typically impose a significant preprocessing overhead. While a high preprocessing overhead can be amortized for applications requiring many iterative invocations of SpMV that use the same matrix, it is not always feasible -- for instance when analyzing large dynamically evolving graphs. This paper presents ACSR, an adaptive SpMV algorithm that uses the standard CSR format but reduces thread divergence by combining rows into groups (bins) which have a similar number of non-zero elements. Further, for rows in bins that span a wide range of non zero counts, dynamic parallelism is leveraged. A significant benefit of ACSR over other proposed SpMV approaches is that it works directly with the standard CSR format, and thus avoids significant preprocessing overheads. A CUDA implementation of ACSR is shown to outperform SpMV implementations in the NVIDIA CUSP and cuSPARSE libraries on a set of sparse matrices representing power-law graphs. We also demonstrate the use of ACSR for the analysis of dynamic graphs, where the improvement over extant approaches is even higher.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115404468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 135

Slim Fly: A Cost Effective Low-Diameter Network Topology 瘦飞:一种低成本的低直径网络拓扑结构

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.34

Maciej Besta, T. Hoefler

引用次数: 243

Parallel Bayesian Network Structure Learning for Genome-Scale Gene Networks 基因组尺度基因网络的并行贝叶斯网络结构学习

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.43

Sanchit Misra, Md. Vasimuddin, K. Pamnany, Sriram P. Chockalingam, Yong Dong, Min Xie, M. Aluru, S. Aluru

引用次数: 16

Understanding the Effects of Communication and Coordination on Checkpointing at Scale 理解沟通和协调对大规模检查点的影响

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.77

Kurt B. Ferreira, Patrick M. Widener, Scott Levy, D. Arnold, T. Hoefler

{"title":"Understanding the Effects of Communication and Coordination on Checkpointing at Scale","authors":"Kurt B. Ferreira, Patrick M. Widener, Scott Levy, D. Arnold, T. Hoefler","doi":"10.1109/SC.2014.77","DOIUrl":"https://doi.org/10.1109/SC.2014.77","url":null,"abstract":"Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid check pointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. However, few insights into selection and tuning of these protocols for applications at scale have emerged. In this paper, we use a simulation-based approach to show that local checkpoint activity in resilience mechanisms can significantly affect the performance of key workloads, even when less than 1% of a local node's compute time is allocated to resilience mechanisms (a very generous assumption). Specifically, we show that even though much work on uncoordinated check pointing has focused on optimizing message log volumes, local check pointing activity may dominate the overheads of this technique at scale. Our study shows that local checkpoints lead to process delays that can propagate through messaging relations to other processes causing a cascading series of delays. We demonstrate how to tune hierarchical uncoordinated check pointing protocols designed to reduce log volumes to significantly reduce these synchronization overheads at scale. Our work provides a critical analysis and comparison of coordinated and uncoordinated check pointing and enables users and system administrators to fine-tune the check pointing scheme to the application and system characteristics.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132020395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications MC-Checker:检测内存一致性错误在MPI单边应用程序

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.46

Zhezhe Chen, James Dinan, Zhen Tang, P. Balaji, Hua Zhong, Jun Wei, Tao Huang, Feng Qin

引用次数: 21

Microbank: Architecting Through-Silicon Interposer-Based Main Memory Systems 微库:基于硅介层的主存系统架构

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.91

Y. Son, O. Seongil, Hyunggyun Yang, Daejin Jung, Jung Ho Ahn, John Kim, Jangwoo Kim, Jae W. Lee

{"title":"Microbank: Architecting Through-Silicon Interposer-Based Main Memory Systems","authors":"Y. Son, O. Seongil, Hyunggyun Yang, Daejin Jung, Jung Ho Ahn, John Kim, Jangwoo Kim, Jae W. Lee","doi":"10.1109/SC.2014.91","DOIUrl":"https://doi.org/10.1109/SC.2014.91","url":null,"abstract":"Through-Silicon Interposer (TSI) has recently been proposed to provide high memory bandwidth and improve energy efficiency of the main memory system. However, the impact of TSI on main memory system architecture has not been well explored. While TSI improves the I/O energy efficiency, we show that it results in an unbalanced memory system design in terms of energy efficiency as the core DRAM dominates overall energy consumption. To balance and enhance the energy efficiency of a TSI-based memory system, we propose μbank, a novel DRAM device organization in which each bank is partitioned into multiple smaller banks (or μbanks) that operate independently like conventional banks with minimal area overhead. The μbank organization significantly increases the amount of bank-level parallelism to improve the performance and energy efficiency of the TSI-based memory system. The massive number of μbanks reduces bank conflicts, hence simplifying the memory system design. We evaluated a sophisticated prediction-based DRAM page-management policy, which can improve performance by up to 20.5% in a conventional memory system without μbanks. However, a μbank-based design does not require such a complex page-management policy and a simple open-page policy is often sufficient -- achieving within 5% of a perfect predictor. Our proposed μbank-based memory system improves the IPC and system energy-delay product by 1.62× and 4.80×, respectively, for memory-intensive SPEC 2006 benchmarks on average, over the baseline DDR3-based memory system.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"5 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113961714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters Xeon Phi集群节点内和节点间卸载的统一编程模型

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.22

M. Noack, Florian Wende, T. Steinke, F. Cordes

引用次数: 16

Scaling MapReduce Vertically and Horizontally 垂直和水平缩放MapReduce

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.48

I. El-Helw, Rutger F. H. Hofman, H. Bal

引用次数: 14

High-Performance Computation of Distributed-Memory Parallel 3D Voronoi and Delaunay Tessellation 分布式存储并行三维Voronoi和Delaunay镶嵌的高性能计算

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.86

T. Peterka, D. Morozov, C. L. Phillips

引用次数: 30