{"title":"Distributed Southwell: An Iterative Method with Low Communication Costs","authors":"Jordi Wolfson-Pou, Edmond Chow","doi":"10.1145/3126908.3126966","DOIUrl":"https://doi.org/10.1145/3126908.3126966","url":null,"abstract":"We present a new algorithm, the Distributed Southwell method, as a competitor to Block Jacobi for preconditioning and multi-grid smoothing. It is based on the Southwell iterative method, which is sequential, where only the equation with the largest residual is relaxed per iteration. The Parallel Southwell method extends this idea by relaxing equation i if it has the largest residual among all the equations coupled to variable i. Since communication is required for processes to exchange residuals, this method in distributed memory can be expensive. Distributed Southwell uses a novel scheme to reduce this communication of residuals while avoiding deadlock. Using test problems from the SuiteSparse Matrix Collection, we show that Distributed Southwell requires less communication to reach the same accuracy when compared to Parallel Southwell. Additionally, we show that the convergence of Distributed Southwell does not degrade like that of Block Jacobi when the number of processes is increased.","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131163515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tessellating Stencils","authors":"Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang","doi":"10.1145/3126908.3126920","DOIUrl":"https://doi.org/10.1145/3126908.3126920","url":null,"abstract":"Stencil computations represent a very common class of nested loops in scientific and engineering applications. The exhaustively studied tiling is one of the most powerful transformation techniques to explore the data locality and parallelism. Unlike previous work, which mostly blocks the iteration space of a stencil directly, this paper proposes a novel two-level tessellation scheme. A set of blocks are designed to tessellate the spatial space in various ways. The blocks can be processed in parallel without redundant computation. This corresponds to extending them along the time dimension and can form a tessellation of the iteration space. Experimental results show that our code performs up to 12% better than the existing highly concurrent schemes for the 3d27p stencil.","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131918929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seung-Hwan Lim, Hyogi Sim, Raghul Gunasekaran, Sudharshan S. Vazhkudai
{"title":"Scientific User Behavior and Data-Sharing Trends in a Petascale File System","authors":"Seung-Hwan Lim, Hyogi Sim, Raghul Gunasekaran, Sudharshan S. Vazhkudai","doi":"10.1145/3126908.3126924","DOIUrl":"https://doi.org/10.1145/3126908.3126924","url":null,"abstract":"The Oak Rrdge Leadership Computing Facility (OLCF) runs the No. 4 supercomputer in the world, supported by a petascale file system, to facilitate scientific discovery. In this paper, using the daily file system metadata snapshots collected over 500 days, we have studied the behavioral trends of 1,362 active users and 380 projects across 35 science domains. In particular, we have analyzed both individual and collective behavior of users and projects, highlighting needs from individual communities and the overall requirements to operate the file system. We have analyzed the metadata across three dimensions, namely (i) the projects’ file generation and usage trends, using quantitative file system-centric metrics, (ii) scientific user behavior on the file system, and (iii) the data sharing trends of users and projects. To the best of our knowledge, our work is the first of its kind to provide comprehensive insights on user behavior from multiple science domains through metadata analysis of a large-scale shared file system. We envision that this OLCF case study will provide valuable insights for the design, operation, and management of storage systems at scale, and also encourage other HPC centers to undertake similar such efforts.CCS CONCEPTS•Software and its engineering →File systems management; •Information systems →Distributed StOrage; •General and reference →Measurement;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"395 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133465709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transactional NVM Cache with High Performance and Crash Consistency","authors":"Q. Wei, Chundong Wang, Cheng Chen, Yechao Yang, Jun Yang, Mingdi Xue","doi":"10.1145/3126908.3126940","DOIUrl":"https://doi.org/10.1145/3126908.3126940","url":null,"abstract":"The byte-addressable non-volatile memory (NVM) is new promising storage medium. Compared to NAND flash memory, the next-generation NVM not only preserves the durability of stored data but has much shorter access latencies. An architect can utilize the fast and persistent NVM as an external disk cache. Regarding the system’s crash consistency, a prevalent journaling file system needs to run atop an NVM disk cache. However, the performance is severely impaired by redundant efforts in achieving crash consistency in both file system and disk cache. Therefore, we propose a new mechanism called transactional NVM disk cache (Tinca). In brief, Tinca jointly guarantees consistency of file system and disk cache and removes the performance penalty of file system journaling with a lightweight transaction scheme. Evaluations confirm that Tinca significantly outperforms state-of-the-art design by up to $2.5 times$ in local and cluster tests without causing any inconsistency issue. CCS CONCEPTS • Information systems $rightarrow$ Storage class memory; • Software andits engineering $rightarrow$ Consistency; File systems management; Operating systems;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128474326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Geometric Multigrid Method Computation using a DSL Approach","authors":"Vinay Vasista, Kumudha Narasimhan, Siddharth Bhat, Uday Bondhugula","doi":"10.1145/3126908.3126968","DOIUrl":"https://doi.org/10.1145/3126908.3126968","url":null,"abstract":"The Geometric Multigrid (GMG) method is widely used in numerical analysis to accelerate the convergence of partial differential equations solvers using a hierarchy of grid discretizations. Multiple grid sizes and recursive expression of multigrid cycles make the task of program optimization tedious. A high-level language that aids domain experts for GMG with effective optimization and parallelization support is thus valuable. We demonstrate how high performance can be achieved along with enhanced programmability for GMG, with new language/optimization support in the PolyMage DSL framework. We compare our approach with (a) hand-optimized code, (b) hand-optimized code in conjunction with polyhedral optimization techniques, and (c) the existing PolyMage optimizer adapted to multigrid. We use benchmarks varying in multigrid cycle structure and smoothing steps for evaluation. On a 24-core Intel Xeon Haswell multicore system, our automatically optimized codes achieve a mean improvement of 3. 2x over straightforward parallelization, and 1. 31x over the PolyMage optimizer.CCS CONCEPTS• Software and its engineering →Compilers;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116851954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low Communication FMM-Accelerated FFT on GPUs","authors":"C. Cecka","doi":"10.1145/3126908.3126919","DOIUrl":"https://doi.org/10.1145/3126908.3126919","url":null,"abstract":"Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate execution time as all industry-standard implementations perform three all-to-all transpositions of the data. In this work, we reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose. We present a detailed and clear implementation strategy that relies heavily on existing library primitives, demonstrate that our strategy achieves consistent speed-ups between 1. 3$times$ and 2. 2$times$ against cuFFTXT on up to eight NVIDIA Tesla P100 GPUs, and develop an accurate compute model to analyze the performance and dependencies of the algorithm.","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129938113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kenneth Raffenetti, A. Amer, Lena Oden, C. Archer, Wesley Bland, H. Fujita, Yanfei Guo, T. Janjusic, D. Durnov, M. Blocksome, Min Si, Sangmin Seo, Akhil Langer, G. Zheng, Masamichi Takagi, Paul K. Coffman, Jithin Jose, S. Sur, Alexander Sannikov, Sergey Oblomov, Michael Chuvelev, M. Hatanaka, Xin Zhao, P. Fischer, T. Rathnayake, M. Otten, M. Min, P. Balaji
{"title":"Why Is MPI So Slow? Analyzing the Fundamental Limits in Implementing MPI-3.1","authors":"Kenneth Raffenetti, A. Amer, Lena Oden, C. Archer, Wesley Bland, H. Fujita, Yanfei Guo, T. Janjusic, D. Durnov, M. Blocksome, Min Si, Sangmin Seo, Akhil Langer, G. Zheng, Masamichi Takagi, Paul K. Coffman, Jithin Jose, S. Sur, Alexander Sannikov, Sergey Oblomov, Michael Chuvelev, M. Hatanaka, Xin Zhao, P. Fischer, T. Rathnayake, M. Otten, M. Min, P. Balaji","doi":"10.1145/3126908.3126963","DOIUrl":"https://doi.org/10.1145/3126908.3126963","url":null,"abstract":"This paper provides an in-depth analysis of the software overheads in the MPI performance-critical path and exposes mandatory performance overheads that are unavoidable based on the MPI-3.1 specification. We first present a highly optimized implementation of the MPI-3.1 standard in which the communication stack-all the way from the application to the low-level network communication API-takes only a few tens of instructions. We carefully study these instructions and analyze the root cause of the overheads based on specific requirements from the MPI standard that are unavoidable under the current MPI standard. We recommend potential changes to the MPI standard that can minimize these overheads. Our experimental results on a variety of network architectures and applications demonstrate significant benefits from our proposed changes. CCS CONCEPTS • Computing methodologies $rightarrow$ Concurrent algorithms; Massively parallel algorithms;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123800493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marc S. Orr, Shuai Che, Bradford M. Beckmann, M. Oskin, S. Reinhardt, D. Wood
{"title":"Gravel: Fine-Grain GPU-Initiated Network Messages","authors":"Marc S. Orr, Shuai Che, Bradford M. Beckmann, M. Oskin, S. Reinhardt, D. Wood","doi":"10.1145/3126908.3126914","DOIUrl":"https://doi.org/10.1145/3126908.3126914","url":null,"abstract":"Distributed systems incorporate GPUs because they provide massive parallelism in an energy-efficient manner. Unfortunately, existing programming models make it difficult to route a GPU-initiated network message. The traditional coprocessor model forces programmers to manually route messages through the host CPU. Other models allow GPU-initiated communication, but are inefficient for small messages. To enable fine-grain PGAS-style communication between threads executing on different GPUs, we introduce Gravel. GPU-initiated messages are offloaded through a GPU-efficient concurrent queue to an aggregator (implemented with CPU threads), which combines messages targeting to the same destination. Gravel leverages diverged work-group-level semantics to amortize synchronization across the GPU’s data-parallel lanes. Using Gravel, we can distribute six applications, each with frequent small messages, across a cluster of eight GPU-accelerated nodes. Compared to one node, these applications run 5.3x faster, on average. Furthermore, we show Gravel is more programmable and usually performs better than prior GPU networking models. CCS CONCEPTS Computer methodologies→Massively parallel algorithms;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127123525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Oliveira, L. Pilla, Nathan Debardeleben, S. Blanchard, H. Quinn, I. Koren, P. Navaux, P. Rech
{"title":"Experimental and Analytical Study of Xeon Phi Reliability","authors":"Daniel Oliveira, L. Pilla, Nathan Debardeleben, S. Blanchard, H. Quinn, I. Koren, P. Navaux, P. Rech","doi":"10.1145/3126908.3126960","DOIUrl":"https://doi.org/10.1145/3126908.3126960","url":null,"abstract":"We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon Phi processors based on radiation experiments and high-level fault injection. Besides measuring the realistic error rates of Xeon Phi, we quantify Silent Data Corruption (SDCs) by correlating the distribution of corrupted elements in the output to the application’s characteristics. We evaluate the benefits of imprecise computing for reducing the programs’ error rate. For example, for HotSpot a 0.5% tolerance in the output value reduces the error rate by 85%. We inject different fault models to analyze the sensitivity of given applications. We show that portions of applications can be graded by different criticalities. For example, faults occurring in the middle of LUD execution, or in the Sort and Tree portions of CLAMR, are more critical than the remaining portions. Mitigation techniques can then be relaxed or hardened based on the criticality of the particular portions.CCS CONCEPTS• Computer systems organization → Parallel architectures; • Hardware → Fault tolerance;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115477890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios","authors":"H. Fu, Conghui He, Bingwei Chen, Zekun Yin, Zhenguo Zhang, Wenqiang Zhang, Tingjian Zhang, Wei Xue, Weiguo Liu, Wanwang Yin, Guangwen Yang, Xiaofei Chen","doi":"10.1145/3126908.3126910","DOIUrl":"https://doi.org/10.1145/3126908.3126910","url":null,"abstract":"This paper reports our large-scale nonlinear earthquake simulation software on Sunway TaihuLight. Our innovations include: (1) a customized parallelization scheme that employs the 10 million cores efficiently at both the process and the thread levels; (2) an elaborate memory scheme that integrates on-chip halo exchange through register communcation, optimized blocking configuration guided by an analytic model, and coalesced DMA access with array fusion; (3) on-the-fly compression that doubles the maximum problem size and further improves the performance by 24%. With these innovations to remove the memory constraints of Sunway TaihuLight, our software achieves over 15% of the system’s peak, better than the 11.8% efficiency achieved by a similar software running on Titan, whose byte to flop ratio is 5 times better than TaihuLight. The extreme cases demonstrate a sustained performance of over 18.9 Pflops, enabling the simulation of Tangshan earthquake as an 18-Hz scenario with an 8-meter resolution.","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116880416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}