SC14: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

筛选
英文 中文
Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format 使用CSR存储格式的gpu上的高效稀疏矩阵向量乘法
J. Greathouse, Mayank Daga
{"title":"Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format","authors":"J. Greathouse, Mayank Daga","doi":"10.1109/SC.2014.68","DOIUrl":"https://doi.org/10.1109/SC.2014.68","url":null,"abstract":"The performance of sparse matrix vector multiplication (SpMV) is important to computational scientists. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMV on graphics processing units (GPUs) has poor performance due to irregular memory access patterns, load imbalance, and reduced parallelism. This has led researchers to propose new storage formats. Unfortunately, dynamically transforming CSR into these formats has significant runtime and storage overheads. We propose a novel algorithm, CSR-Adaptive, which keeps the CSR format intact and maps well to GPUs. Our implementation addresses the aforementioned challenges by (i) efficiently accessing DRAM by streaming data into the local scratchpad memory and (ii) dynamically assigning different numbers of rows to each parallel GPU compute unit. CSR-Adaptive achieves an average speedup of 14.7× over existing CSR-based algorithms and 2.3× over clSpMV cocktail, which uses an assortment of matrix formats.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128710529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 182
Scaling the Power Wall: A Path to Exascale 扩展功率墙:通往百亿亿级的道路
Oreste Villa, Daniel R. Johnson, Mike O'Connor, Evgeny Bolotin, D. Nellans, J. Luitjens, Nikolai Sakharnykh, Peng Wang, P. Micikevicius, Anthony Scudiero, S. Keckler, W. Dally
{"title":"Scaling the Power Wall: A Path to Exascale","authors":"Oreste Villa, Daniel R. Johnson, Mike O'Connor, Evgeny Bolotin, D. Nellans, J. Luitjens, Nikolai Sakharnykh, Peng Wang, P. Micikevicius, Anthony Scudiero, S. Keckler, W. Dally","doi":"10.1109/SC.2014.73","DOIUrl":"https://doi.org/10.1109/SC.2014.73","url":null,"abstract":"Modern scientific discovery is driven by an insatiable demand for computing performance. The HPC community is targeting development of supercomputers able to sustain 1 ExaFlops by the year 2020 and power consumption is the primary obstacle to achieving this goal. A combination of architectural improvements, circuit design, and manufacturing technologies must provide over a 20× improvement in energy efficiency. In this paper, we present some of the progress NVIDIA Research is making toward the design of Exascale systems by tailoring features to address the scaling challenges of performance and energy efficiency. We evaluate several architectural concepts for a set of HPC applications demonstrating expected energy efficiency improvements resulting from circuit and packaging innovations such as low-voltage SRAM, low-energy signalling, and on-package memory. Finally, we discuss the scaling of these features with respect to future process technologies and provide power and performance projections for our Exascale research architecture.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124135014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 126
pTatin3D: High-Performance Methods for Long-Term Lithospheric Dynamics pTatin3D:长期岩石圈动力学的高性能方法
D. May, Jed Brown, L. Pourhiet
{"title":"pTatin3D: High-Performance Methods for Long-Term Lithospheric Dynamics","authors":"D. May, Jed Brown, L. Pourhiet","doi":"10.1109/SC.2014.28","DOIUrl":"https://doi.org/10.1109/SC.2014.28","url":null,"abstract":"Simulations of long-term lithospheric deformation involve post-failure analysis of high-contrast brittle materials driven by buoyancy and processes at the free surface. Geodynamic phenomena such as subduction and continental rifting take place over millions year time scales, thus require efficient solution methods. We present pTatin3D, a geodynamics modeling package utilising the material-point-method for tracking material composition, combined with a multigrid finite-element method to solve heterogeneous, incompressible visco-plastic Stokes problems. Here we analyze the performance and algorithmic tradeoffs of pTatin3D's multigrid preconditioner. Our matrix-free geometric multigrid preconditioner trades flops for memory bandwidth to produce a time-to-solution > 2× faster than the best available methods utilising stored matrices (plagued by memory bandwidth limitations), exploits local element structure to achieve weak scaling at 30% of FPU peak on Cray XC-30, has improved dynamic range due to smaller memory footprint, and has more consistent timing and better intra-node scalability due to reduced memory-bus and cache pressure.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129142533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 62
A User-Friendly Approach for Tuning Parallel File Operations 一个用户友好的方法来调整并行文件操作
R. McLay, D. James, Si Liu, J. Cazes, W. Barth
{"title":"A User-Friendly Approach for Tuning Parallel File Operations","authors":"R. McLay, D. James, Si Liu, J. Cazes, W. Barth","doi":"10.1109/SC.2014.24","DOIUrl":"https://doi.org/10.1109/SC.2014.24","url":null,"abstract":"The Lustre file system provides high aggregated I/O bandwidth and is in widespread use throughout the HPC community. Here we report on work (1) developing a model for understanding collective parallel MPI write operations on Lustre, and (2) producing a library that optimizes parallel write performance in a user-friendly way. We note that a system's default stripe count is rarely a good choice for parallel I/O, and that performance depends on a delicate balance between the number of stripes and the actual (not requested) number of collective writers. Unfortunate combinations of these parameters may degrade performance considerably. For the programmer, however, it's all about the stripe count: an informed choice of this single parameter allows MPI to assign writers in a way that achieves near-optimal performance. We offer recommendations for those who wish to tune performance manually and describe the easy-to-use T3PIO library that manages the tuning automatically.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129367909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
In-Situ Feature Extraction of Large Scale Combustion Simulations Using Segmented Merge Trees 基于分割合并树的大规模燃烧模拟现场特征提取
Aaditya G. Landge, Valerio Pascucci, A. Gyulassy, Janine Bennett, H. Kolla, Jacqueline H. Chen, P. Bremer
{"title":"In-Situ Feature Extraction of Large Scale Combustion Simulations Using Segmented Merge Trees","authors":"Aaditya G. Landge, Valerio Pascucci, A. Gyulassy, Janine Bennett, H. Kolla, Jacqueline H. Chen, P. Bremer","doi":"10.1109/SC.2014.88","DOIUrl":"https://doi.org/10.1109/SC.2014.88","url":null,"abstract":"The ever increasing amount of data generated by scientific simulations coupled with system I/O constraints are fueling a need for in-situ analysis techniques. Of particular interest are approaches that produce reduced data representations while maintaining the ability to redefine, extract, and study features in a post-process to obtain scientific insights. This paper presents two variants of in-situ feature extraction techniques using segmented merge trees, which encode a wide range of threshold based features. The first approach is a fast, low communication cost technique that generates an exact solution but has limited scalability. The second is a scalable, local approximation that nevertheless is guaranteed to correctly extract all features up to a predefined size. We demonstrate both variants using some of the largest combustion simulations available on leadership class supercomputers. Our approach allows state-of-the-art, feature-based analysis to be performed in-situ at significantly higher frequency than currently possible and with negligible impact on the overall simulation runtime.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"139 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134366363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 66
Structure Slicing: Extending Logical Regions with Fields 结构切片:用字段扩展逻辑区域
Michael A. Bauer, Sean Treichler, Elliott Slaughter, A. Aiken
{"title":"Structure Slicing: Extending Logical Regions with Fields","authors":"Michael A. Bauer, Sean Treichler, Elliott Slaughter, A. Aiken","doi":"10.1109/SC.2014.74","DOIUrl":"https://doi.org/10.1109/SC.2014.74","url":null,"abstract":"Applications on modern supercomputers are increasingly limited by the cost of data movement, but mainstream programming systems have few abstractions for describing the structure of a program's data. Consequently, the burden of managing data movement, placement, and layout currently falls primarily upon the programmer. To address this problem we previously proposed a data model based on logical regions and described Legion, a programming system incorporating logical regions. In this paper, we present structure slicing, which incorporates fields into the logical region data model. We show that structure slicing enables Legion to automatically infer task parallelism from field non-interference, decouple the specification of data usage from layout, and reduce the overall amount of data moved. We demonstrate that structure slicing enables both strong and weak scaling of three Legion applications including S3D, a production combustion simulation that uses logical regions with thousands of fields, with speedups of up to 3.68X over a vectorized CPU-only Fortran implementation and 1.88X over an independently hand-tuned OpenACC code.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"36 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131327594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
An Image-Based Approach to Extreme Scale in Situ Visualization and Analysis 一种基于图像的极端尺度原位可视化与分析方法
J. Ahrens, S. Jourdain, P. O’leary, J. Patchett, D. Rogers, M. Petersen
{"title":"An Image-Based Approach to Extreme Scale in Situ Visualization and Analysis","authors":"J. Ahrens, S. Jourdain, P. O’leary, J. Patchett, D. Rogers, M. Petersen","doi":"10.1109/SC.2014.40","DOIUrl":"https://doi.org/10.1109/SC.2014.40","url":null,"abstract":"Extreme scale scientific simulations are leading a charge to exascale computation, and data analytics runs the risk of being a bottleneck to scientific discovery. Due to power and I/O constraints, we expect in situ visualization and analysis will be a critical component of these workflows. Options for extreme scale data analysis are often presented as a stark contrast: write large files to disk for interactive, exploratory analysis, or perform in situ analysis to save detailed data about phenomena that a scientists knows about in advance. We present a novel framework for a third option - a highly interactive, image-based approach that promotes exploration of simulation results, and is easily accessed through extensions to widely used open source tools. This in situ approach supports interactive exploration of a wide range of results, while still significantly reducing data movement and storage.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129849558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 191
Metascalable Quantum Molecular Dynamics Simulations of Hydrogen-on-Demand 按需氢的元可伸缩量子分子动力学模拟
K. Nomura, R. Kalia, A. Nakano, P. Vashishta, K. Shimamura, F. Shimojo, Manaschai Kunaseth, P. Messina, N. A. Romero
{"title":"Metascalable Quantum Molecular Dynamics Simulations of Hydrogen-on-Demand","authors":"K. Nomura, R. Kalia, A. Nakano, P. Vashishta, K. Shimamura, F. Shimojo, Manaschai Kunaseth, P. Messina, N. A. Romero","doi":"10.1109/SC.2014.59","DOIUrl":"https://doi.org/10.1109/SC.2014.59","url":null,"abstract":"We enabled an unprecedented scale of quantum molecular dynamics simulations through algorithmic innovations. A new lean divide-and-conquer density functional theory algorithm significantly reduces the prefactor of the O(N) computational cost based on complexity and error analyses. A globally scalable and locally fast solver hybridizes a global real-space multigrid with local plane-wave bases. The resulting weak-scaling parallel efficiency was 0.984 on 786,432 IBM Blue Gene/Q cores for a 50.3 million-atom (39.8 trillion degrees-of-freedom) system. The time-to-solution was 60-times less than the previous state-of-the art, owing to enhanced strong scaling by hierarchical band-space domain decomposition and high floating-point performance (50.5% of the peak). Production simulation involving 16,661 atoms for 21,140 time steps (or 129,208 self-consistent-field iterations) revealed a novel nanostructural design for on-demand hydrogen production from water, advancing renewable energy technologies. This metascalable (or \"design once, scale on new architectures\") algorithm is used for broader applications within a recently proposed divide-conquer-recombine paradigm.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128849376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
RAHTM: Routing Algorithm Aware Hierarchical Task Mapping 路由算法感知分层任务映射
Ahmed H. Abdel-Gawad, Mithuna Thottethodi, A. Bhatele
{"title":"RAHTM: Routing Algorithm Aware Hierarchical Task Mapping","authors":"Ahmed H. Abdel-Gawad, Mithuna Thottethodi, A. Bhatele","doi":"10.1109/SC.2014.32","DOIUrl":"https://doi.org/10.1109/SC.2014.32","url":null,"abstract":"The mapping of MPI processes to compute nodes on a supercomputer can have a significant impact on communication performance. For high performance computing (HPC) applications with iterative communication, rich offline analysis of such communication can improve performance by optimizing the mapping. Unfortunately, current practices for at-scale HPC consider only the communication graph and network topology in solving this problem. We propose Routing Algorithm aware Hierarchical Task Mapping (RAHTM) which leverages the knowledge of the routing algorithm to improve task mapping. RAHTM achieves high quality mappings by combining (1) a divide-and-conquer strategy to achieve scalability, (2) a limited search of mappings, and (3) a linear programming based routing-aware approach to evaluate possible mappings in the search space. RAHTM achieves 20% reduction in the communication time and 9% reduction in the overall execution time for three communication-heavy benchmarks scaled up to 16,384 processes on a Blue Gene/Q platform.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134106765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Finding Constant from Change: Revisiting Network Performance Aware Optimizations on IaaS Clouds 从变化中寻找常数:重新审视IaaS云上的网络性能感知优化
Yifan Gong, Bingsheng He, Dan Li
{"title":"Finding Constant from Change: Revisiting Network Performance Aware Optimizations on IaaS Clouds","authors":"Yifan Gong, Bingsheng He, Dan Li","doi":"10.1109/SC.2014.85","DOIUrl":"https://doi.org/10.1109/SC.2014.85","url":null,"abstract":"Network performance aware optimizations have long been an effective approach to optimizing distributed applications on traditional network environments. However, the assumptions of network topology or direct use of several measurements of pair-wise network performance for optimizations are no longer valid on IaaS clouds. Virtualization hides network topology from users, and direct use of network performance measurements may not represent long-term performance. To enable existing network performance aware optimizations on IaaS clouds, we propose to decouple constant component from dynamic network performance while minimizing the difference by a mathematical method called RPCA (Robust Principal Component Analysis). We use the constant component to guide network performance aware optimizations and demonstrate the efficiency of our approach by adopting network aware optimizations for collective communications of MPI and generic topology mapping as well as two real-world applications, N-body and conjugate gradient (CG). Our experiments on Amazon EC2 and simulations demonstrate significant performance improvement on guiding the optimizations.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125721396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信