SC17: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

筛选
英文 中文
Distributed Southwell: An Iterative Method with Low Communication Costs 分布式Southwell:一种低通信成本的迭代方法
Jordi Wolfson-Pou, Edmond Chow
{"title":"Distributed Southwell: An Iterative Method with Low Communication Costs","authors":"Jordi Wolfson-Pou, Edmond Chow","doi":"10.1145/3126908.3126966","DOIUrl":"https://doi.org/10.1145/3126908.3126966","url":null,"abstract":"We present a new algorithm, the Distributed Southwell method, as a competitor to Block Jacobi for preconditioning and multi-grid smoothing. It is based on the Southwell iterative method, which is sequential, where only the equation with the largest residual is relaxed per iteration. The Parallel Southwell method extends this idea by relaxing equation i if it has the largest residual among all the equations coupled to variable i. Since communication is required for processes to exchange residuals, this method in distributed memory can be expensive. Distributed Southwell uses a novel scheme to reduce this communication of residuals while avoiding deadlock. Using test problems from the SuiteSparse Matrix Collection, we show that Distributed Southwell requires less communication to reach the same accuracy when compared to Parallel Southwell. Additionally, we show that the convergence of Distributed Southwell does not degrade like that of Block Jacobi when the number of processes is increased.","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131163515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Tessellating Stencils 镶嵌细工的模板
Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang
{"title":"Tessellating Stencils","authors":"Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang","doi":"10.1145/3126908.3126920","DOIUrl":"https://doi.org/10.1145/3126908.3126920","url":null,"abstract":"Stencil computations represent a very common class of nested loops in scientific and engineering applications. The exhaustively studied tiling is one of the most powerful transformation techniques to explore the data locality and parallelism. Unlike previous work, which mostly blocks the iteration space of a stencil directly, this paper proposes a novel two-level tessellation scheme. A set of blocks are designed to tessellate the spatial space in various ways. The blocks can be processed in parallel without redundant computation. This corresponds to extending them along the time dimension and can form a tessellation of the iteration space. Experimental results show that our code performs up to 12% better than the existing highly concurrent schemes for the 3d27p stencil.","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131918929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Scientific User Behavior and Data-Sharing Trends in a Petascale File System 千兆级文件系统中的科学用户行为和数据共享趋势
Seung-Hwan Lim, Hyogi Sim, Raghul Gunasekaran, Sudharshan S. Vazhkudai
{"title":"Scientific User Behavior and Data-Sharing Trends in a Petascale File System","authors":"Seung-Hwan Lim, Hyogi Sim, Raghul Gunasekaran, Sudharshan S. Vazhkudai","doi":"10.1145/3126908.3126924","DOIUrl":"https://doi.org/10.1145/3126908.3126924","url":null,"abstract":"The Oak Rrdge Leadership Computing Facility (OLCF) runs the No. 4 supercomputer in the world, supported by a petascale file system, to facilitate scientific discovery. In this paper, using the daily file system metadata snapshots collected over 500 days, we have studied the behavioral trends of 1,362 active users and 380 projects across 35 science domains. In particular, we have analyzed both individual and collective behavior of users and projects, highlighting needs from individual communities and the overall requirements to operate the file system. We have analyzed the metadata across three dimensions, namely (i) the projects’ file generation and usage trends, using quantitative file system-centric metrics, (ii) scientific user behavior on the file system, and (iii) the data sharing trends of users and projects. To the best of our knowledge, our work is the first of its kind to provide comprehensive insights on user behavior from multiple science domains through metadata analysis of a large-scale shared file system. We envision that this OLCF case study will provide valuable insights for the design, operation, and management of storage systems at scale, and also encourage other HPC centers to undertake similar such efforts.CCS CONCEPTS•Software and its engineering →File systems management; •Information systems →Distributed StOrage; •General and reference →Measurement;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"395 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133465709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Transactional NVM Cache with High Performance and Crash Consistency 具有高性能和崩溃一致性的事务性NVM缓存
Q. Wei, Chundong Wang, Cheng Chen, Yechao Yang, Jun Yang, Mingdi Xue
{"title":"Transactional NVM Cache with High Performance and Crash Consistency","authors":"Q. Wei, Chundong Wang, Cheng Chen, Yechao Yang, Jun Yang, Mingdi Xue","doi":"10.1145/3126908.3126940","DOIUrl":"https://doi.org/10.1145/3126908.3126940","url":null,"abstract":"The byte-addressable non-volatile memory (NVM) is new promising storage medium. Compared to NAND flash memory, the next-generation NVM not only preserves the durability of stored data but has much shorter access latencies. An architect can utilize the fast and persistent NVM as an external disk cache. Regarding the system’s crash consistency, a prevalent journaling file system needs to run atop an NVM disk cache. However, the performance is severely impaired by redundant efforts in achieving crash consistency in both file system and disk cache. Therefore, we propose a new mechanism called transactional NVM disk cache (Tinca). In brief, Tinca jointly guarantees consistency of file system and disk cache and removes the performance penalty of file system journaling with a lightweight transaction scheme. Evaluations confirm that Tinca significantly outperforms state-of-the-art design by up to $2.5 times$ in local and cluster tests without causing any inconsistency issue. CCS CONCEPTS • Information systems $rightarrow$ Storage class memory; • Software andits engineering $rightarrow$ Consistency; File systems management; Operating systems;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128474326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Optimizing Geometric Multigrid Method Computation using a DSL Approach 用DSL方法优化几何多重网格方法的计算
Vinay Vasista, Kumudha Narasimhan, Siddharth Bhat, Uday Bondhugula
{"title":"Optimizing Geometric Multigrid Method Computation using a DSL Approach","authors":"Vinay Vasista, Kumudha Narasimhan, Siddharth Bhat, Uday Bondhugula","doi":"10.1145/3126908.3126968","DOIUrl":"https://doi.org/10.1145/3126908.3126968","url":null,"abstract":"The Geometric Multigrid (GMG) method is widely used in numerical analysis to accelerate the convergence of partial differential equations solvers using a hierarchy of grid discretizations. Multiple grid sizes and recursive expression of multigrid cycles make the task of program optimization tedious. A high-level language that aids domain experts for GMG with effective optimization and parallelization support is thus valuable. We demonstrate how high performance can be achieved along with enhanced programmability for GMG, with new language/optimization support in the PolyMage DSL framework. We compare our approach with (a) hand-optimized code, (b) hand-optimized code in conjunction with polyhedral optimization techniques, and (c) the existing PolyMage optimizer adapted to multigrid. We use benchmarks varying in multigrid cycle structure and smoothing steps for evaluation. On a 24-core Intel Xeon Haswell multicore system, our automatically optimized codes achieve a mean improvement of 3. 2x over straightforward parallelization, and 1. 31x over the PolyMage optimizer.CCS CONCEPTS• Software and its engineering →Compilers;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116851954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Low Communication FMM-Accelerated FFT on GPUs gpu上低通信fmm加速FFT
C. Cecka
{"title":"Low Communication FMM-Accelerated FFT on GPUs","authors":"C. Cecka","doi":"10.1145/3126908.3126919","DOIUrl":"https://doi.org/10.1145/3126908.3126919","url":null,"abstract":"Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate execution time as all industry-standard implementations perform three all-to-all transpositions of the data. In this work, we reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose. We present a detailed and clear implementation strategy that relies heavily on existing library primitives, demonstrate that our strategy achieves consistent speed-ups between 1. 3$times$ and 2. 2$times$ against cuFFTXT on up to eight NVIDIA Tesla P100 GPUs, and develop an accurate compute model to analyze the performance and dependencies of the algorithm.","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129938113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Why Is MPI So Slow? Analyzing the Fundamental Limits in Implementing MPI-3.1 为什么MPI这么慢?分析MPI-3.1实现的基本限制
Kenneth Raffenetti, A. Amer, Lena Oden, C. Archer, Wesley Bland, H. Fujita, Yanfei Guo, T. Janjusic, D. Durnov, M. Blocksome, Min Si, Sangmin Seo, Akhil Langer, G. Zheng, Masamichi Takagi, Paul K. Coffman, Jithin Jose, S. Sur, Alexander Sannikov, Sergey Oblomov, Michael Chuvelev, M. Hatanaka, Xin Zhao, P. Fischer, T. Rathnayake, M. Otten, M. Min, P. Balaji
{"title":"Why Is MPI So Slow? Analyzing the Fundamental Limits in Implementing MPI-3.1","authors":"Kenneth Raffenetti, A. Amer, Lena Oden, C. Archer, Wesley Bland, H. Fujita, Yanfei Guo, T. Janjusic, D. Durnov, M. Blocksome, Min Si, Sangmin Seo, Akhil Langer, G. Zheng, Masamichi Takagi, Paul K. Coffman, Jithin Jose, S. Sur, Alexander Sannikov, Sergey Oblomov, Michael Chuvelev, M. Hatanaka, Xin Zhao, P. Fischer, T. Rathnayake, M. Otten, M. Min, P. Balaji","doi":"10.1145/3126908.3126963","DOIUrl":"https://doi.org/10.1145/3126908.3126963","url":null,"abstract":"This paper provides an in-depth analysis of the software overheads in the MPI performance-critical path and exposes mandatory performance overheads that are unavoidable based on the MPI-3.1 specification. We first present a highly optimized implementation of the MPI-3.1 standard in which the communication stack-all the way from the application to the low-level network communication API-takes only a few tens of instructions. We carefully study these instructions and analyze the root cause of the overheads based on specific requirements from the MPI standard that are unavoidable under the current MPI standard. We recommend potential changes to the MPI standard that can minimize these overheads. Our experimental results on a variety of network architectures and applications demonstrate significant benefits from our proposed changes. CCS CONCEPTS • Computing methodologies $rightarrow$ Concurrent algorithms; Massively parallel algorithms;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123800493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Gravel: Fine-Grain GPU-Initiated Network Messages 砾石:细粒度gpu发起的网络消息
Marc S. Orr, Shuai Che, Bradford M. Beckmann, M. Oskin, S. Reinhardt, D. Wood
{"title":"Gravel: Fine-Grain GPU-Initiated Network Messages","authors":"Marc S. Orr, Shuai Che, Bradford M. Beckmann, M. Oskin, S. Reinhardt, D. Wood","doi":"10.1145/3126908.3126914","DOIUrl":"https://doi.org/10.1145/3126908.3126914","url":null,"abstract":"Distributed systems incorporate GPUs because they provide massive parallelism in an energy-efficient manner. Unfortunately, existing programming models make it difficult to route a GPU-initiated network message. The traditional coprocessor model forces programmers to manually route messages through the host CPU. Other models allow GPU-initiated communication, but are inefficient for small messages. To enable fine-grain PGAS-style communication between threads executing on different GPUs, we introduce Gravel. GPU-initiated messages are offloaded through a GPU-efficient concurrent queue to an aggregator (implemented with CPU threads), which combines messages targeting to the same destination. Gravel leverages diverged work-group-level semantics to amortize synchronization across the GPU’s data-parallel lanes. Using Gravel, we can distribute six applications, each with frequent small messages, across a cluster of eight GPU-accelerated nodes. Compared to one node, these applications run 5.3x faster, on average. Furthermore, we show Gravel is more programmable and usually performs better than prior GPU networking models. CCS CONCEPTS Computer methodologies→Massively parallel algorithms;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127123525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Experimental and Analytical Study of Xeon Phi Reliability Xeon Phi协处理器可靠性的实验与分析研究
Daniel Oliveira, L. Pilla, Nathan Debardeleben, S. Blanchard, H. Quinn, I. Koren, P. Navaux, P. Rech
{"title":"Experimental and Analytical Study of Xeon Phi Reliability","authors":"Daniel Oliveira, L. Pilla, Nathan Debardeleben, S. Blanchard, H. Quinn, I. Koren, P. Navaux, P. Rech","doi":"10.1145/3126908.3126960","DOIUrl":"https://doi.org/10.1145/3126908.3126960","url":null,"abstract":"We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon Phi processors based on radiation experiments and high-level fault injection. Besides measuring the realistic error rates of Xeon Phi, we quantify Silent Data Corruption (SDCs) by correlating the distribution of corrupted elements in the output to the application’s characteristics. We evaluate the benefits of imprecise computing for reducing the programs’ error rate. For example, for HotSpot a 0.5% tolerance in the output value reduces the error rate by 85%. We inject different fault models to analyze the sensitivity of given applications. We show that portions of applications can be graded by different criticalities. For example, faults occurring in the middle of LUD execution, or in the Sort and Tree portions of CLAMR, are more critical than the remaining portions. Mitigation techniques can then be relaxed or hardened based on the criticality of the particular portions.CCS CONCEPTS• Computer systems organization → Parallel architectures; • Hardware → Fault tolerance;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115477890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios “神威太湖之光”上18.9 pflops非线性地震模拟:18hz和8米情景的使能描述
H. Fu, Conghui He, Bingwei Chen, Zekun Yin, Zhenguo Zhang, Wenqiang Zhang, Tingjian Zhang, Wei Xue, Weiguo Liu, Wanwang Yin, Guangwen Yang, Xiaofei Chen
{"title":"18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios","authors":"H. Fu, Conghui He, Bingwei Chen, Zekun Yin, Zhenguo Zhang, Wenqiang Zhang, Tingjian Zhang, Wei Xue, Weiguo Liu, Wanwang Yin, Guangwen Yang, Xiaofei Chen","doi":"10.1145/3126908.3126910","DOIUrl":"https://doi.org/10.1145/3126908.3126910","url":null,"abstract":"This paper reports our large-scale nonlinear earthquake simulation software on Sunway TaihuLight. Our innovations include: (1) a customized parallelization scheme that employs the 10 million cores efficiently at both the process and the thread levels; (2) an elaborate memory scheme that integrates on-chip halo exchange through register communcation, optimized blocking configuration guided by an analytic model, and coalesced DMA access with array fusion; (3) on-the-fly compression that doubles the maximum problem size and further improves the performance by 24%. With these innovations to remove the memory constraints of Sunway TaihuLight, our software achieves over 15% of the system’s peak, better than the 11.8% efficiency achieved by a similar software running on Titan, whose byte to flop ratio is 5 times better than TaihuLight. The extreme cases demonstrate a sustained performance of over 18.9 Pflops, enabling the simulation of Tangshan earthquake as an 18-Hz scenario with an 8-meter resolution.","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116880416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 96
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信