SC17: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献_第2页

Distributed Southwell: An Iterative Method with Low Communication Costs 分布式Southwell:一种低通信成本的迭代方法

SC17: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2017-11-12 DOI: 10.1145/3126908.3126966

Jordi Wolfson-Pou, Edmond Chow

引用次数: 7

Tessellating Stencils 镶嵌细工的模板

SC17: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2017-11-12 DOI: 10.1145/3126908.3126920

Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang

引用次数: 14

Scientific User Behavior and Data-Sharing Trends in a Petascale File System 千兆级文件系统中的科学用户行为和数据共享趋势

SC17: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2017-11-12 DOI: 10.1145/3126908.3126924

Seung-Hwan Lim, Hyogi Sim, Raghul Gunasekaran, Sudharshan S. Vazhkudai

{"title":"Scientific User Behavior and Data-Sharing Trends in a Petascale File System","authors":"Seung-Hwan Lim, Hyogi Sim, Raghul Gunasekaran, Sudharshan S. Vazhkudai","doi":"10.1145/3126908.3126924","DOIUrl":"https://doi.org/10.1145/3126908.3126924","url":null,"abstract":"The Oak Rrdge Leadership Computing Facility (OLCF) runs the No. 4 supercomputer in the world, supported by a petascale file system, to facilitate scientific discovery. In this paper, using the daily file system metadata snapshots collected over 500 days, we have studied the behavioral trends of 1,362 active users and 380 projects across 35 science domains. In particular, we have analyzed both individual and collective behavior of users and projects, highlighting needs from individual communities and the overall requirements to operate the file system. We have analyzed the metadata across three dimensions, namely (i) the projects’ file generation and usage trends, using quantitative file system-centric metrics, (ii) scientific user behavior on the file system, and (iii) the data sharing trends of users and projects. To the best of our knowledge, our work is the first of its kind to provide comprehensive insights on user behavior from multiple science domains through metadata analysis of a large-scale shared file system. We envision that this OLCF case study will provide valuable insights for the design, operation, and management of storage systems at scale, and also encourage other HPC centers to undertake similar such efforts.CCS CONCEPTS•Software and its engineering →File systems management; •Information systems →Distributed StOrage; •General and reference →Measurement;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"395 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133465709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Transactional NVM Cache with High Performance and Crash Consistency 具有高性能和崩溃一致性的事务性NVM缓存

SC17: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2017-11-12 DOI: 10.1145/3126908.3126940

Q. Wei, Chundong Wang, Cheng Chen, Yechao Yang, Jun Yang, Mingdi Xue

{"title":"Transactional NVM Cache with High Performance and Crash Consistency","authors":"Q. Wei, Chundong Wang, Cheng Chen, Yechao Yang, Jun Yang, Mingdi Xue","doi":"10.1145/3126908.3126940","DOIUrl":"https://doi.org/10.1145/3126908.3126940","url":null,"abstract":"The byte-addressable non-volatile memory (NVM) is new promising storage medium. Compared to NAND flash memory, the next-generation NVM not only preserves the durability of stored data but has much shorter access latencies. An architect can utilize the fast and persistent NVM as an external disk cache. Regarding the system’s crash consistency, a prevalent journaling file system needs to run atop an NVM disk cache. However, the performance is severely impaired by redundant efforts in achieving crash consistency in both file system and disk cache. Therefore, we propose a new mechanism called transactional NVM disk cache (Tinca). In brief, Tinca jointly guarantees consistency of file system and disk cache and removes the performance penalty of file system journaling with a lightweight transaction scheme. Evaluations confirm that Tinca significantly outperforms state-of-the-art design by up to $2.5 times$ in local and cluster tests without causing any inconsistency issue. CCS CONCEPTS • Information systems $rightarrow$ Storage class memory; • Software andits engineering $rightarrow$ Consistency; File systems management; Operating systems;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128474326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Optimizing Geometric Multigrid Method Computation using a DSL Approach 用DSL方法优化几何多重网格方法的计算

SC17: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2017-11-12 DOI: 10.1145/3126908.3126968

Vinay Vasista, Kumudha Narasimhan, Siddharth Bhat, Uday Bondhugula

{"title":"Optimizing Geometric Multigrid Method Computation using a DSL Approach","authors":"Vinay Vasista, Kumudha Narasimhan, Siddharth Bhat, Uday Bondhugula","doi":"10.1145/3126908.3126968","DOIUrl":"https://doi.org/10.1145/3126908.3126968","url":null,"abstract":"The Geometric Multigrid (GMG) method is widely used in numerical analysis to accelerate the convergence of partial differential equations solvers using a hierarchy of grid discretizations. Multiple grid sizes and recursive expression of multigrid cycles make the task of program optimization tedious. A high-level language that aids domain experts for GMG with effective optimization and parallelization support is thus valuable. We demonstrate how high performance can be achieved along with enhanced programmability for GMG, with new language/optimization support in the PolyMage DSL framework. We compare our approach with (a) hand-optimized code, (b) hand-optimized code in conjunction with polyhedral optimization techniques, and (c) the existing PolyMage optimizer adapted to multigrid. We use benchmarks varying in multigrid cycle structure and smoothing steps for evaluation. On a 24-core Intel Xeon Haswell multicore system, our automatically optimized codes achieve a mean improvement of 3. 2x over straightforward parallelization, and 1. 31x over the PolyMage optimizer.CCS CONCEPTS• Software and its engineering →Compilers;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116851954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Low Communication FMM-Accelerated FFT on GPUs gpu上低通信fmm加速FFT

SC17: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2017-11-12 DOI: 10.1145/3126908.3126919

C. Cecka

引用次数: 8

Why Is MPI So Slow? Analyzing the Fundamental Limits in Implementing MPI-3.1 为什么MPI这么慢?分析MPI-3.1实现的基本限制

SC17: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2017-11-12 DOI: 10.1145/3126908.3126963

Kenneth Raffenetti, A. Amer, Lena Oden, C. Archer, Wesley Bland, H. Fujita, Yanfei Guo, T. Janjusic, D. Durnov, M. Blocksome, Min Si, Sangmin Seo, Akhil Langer, G. Zheng, Masamichi Takagi, Paul K. Coffman, Jithin Jose, S. Sur, Alexander Sannikov, Sergey Oblomov, Michael Chuvelev, M. Hatanaka, Xin Zhao, P. Fischer, T. Rathnayake, M. Otten, M. Min, P. Balaji

{"title":"Why Is MPI So Slow? Analyzing the Fundamental Limits in Implementing MPI-3.1","authors":"Kenneth Raffenetti, A. Amer, Lena Oden, C. Archer, Wesley Bland, H. Fujita, Yanfei Guo, T. Janjusic, D. Durnov, M. Blocksome, Min Si, Sangmin Seo, Akhil Langer, G. Zheng, Masamichi Takagi, Paul K. Coffman, Jithin Jose, S. Sur, Alexander Sannikov, Sergey Oblomov, Michael Chuvelev, M. Hatanaka, Xin Zhao, P. Fischer, T. Rathnayake, M. Otten, M. Min, P. Balaji","doi":"10.1145/3126908.3126963","DOIUrl":"https://doi.org/10.1145/3126908.3126963","url":null,"abstract":"This paper provides an in-depth analysis of the software overheads in the MPI performance-critical path and exposes mandatory performance overheads that are unavoidable based on the MPI-3.1 specification. We first present a highly optimized implementation of the MPI-3.1 standard in which the communication stack-all the way from the application to the low-level network communication API-takes only a few tens of instructions. We carefully study these instructions and analyze the root cause of the overheads based on specific requirements from the MPI standard that are unavoidable under the current MPI standard. We recommend potential changes to the MPI standard that can minimize these overheads. Our experimental results on a variety of network architectures and applications demonstrate significant benefits from our proposed changes. CCS CONCEPTS • Computing methodologies $rightarrow$ Concurrent algorithms; Massively parallel algorithms;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123800493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Gravel: Fine-Grain GPU-Initiated Network Messages 砾石:细粒度gpu发起的网络消息

SC17: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2017-11-12 DOI: 10.1145/3126908.3126914

Marc S. Orr, Shuai Che, Bradford M. Beckmann, M. Oskin, S. Reinhardt, D. Wood

{"title":"Gravel: Fine-Grain GPU-Initiated Network Messages","authors":"Marc S. Orr, Shuai Che, Bradford M. Beckmann, M. Oskin, S. Reinhardt, D. Wood","doi":"10.1145/3126908.3126914","DOIUrl":"https://doi.org/10.1145/3126908.3126914","url":null,"abstract":"Distributed systems incorporate GPUs because they provide massive parallelism in an energy-efficient manner. Unfortunately, existing programming models make it difficult to route a GPU-initiated network message. The traditional coprocessor model forces programmers to manually route messages through the host CPU. Other models allow GPU-initiated communication, but are inefficient for small messages. To enable fine-grain PGAS-style communication between threads executing on different GPUs, we introduce Gravel. GPU-initiated messages are offloaded through a GPU-efficient concurrent queue to an aggregator (implemented with CPU threads), which combines messages targeting to the same destination. Gravel leverages diverged work-group-level semantics to amortize synchronization across the GPU’s data-parallel lanes. Using Gravel, we can distribute six applications, each with frequent small messages, across a cluster of eight GPU-accelerated nodes. Compared to one node, these applications run 5.3x faster, on average. Furthermore, we show Gravel is more programmable and usually performs better than prior GPU networking models. CCS CONCEPTS Computer methodologies→Massively parallel algorithms;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127123525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Experimental and Analytical Study of Xeon Phi Reliability Xeon Phi协处理器可靠性的实验与分析研究

SC17: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2017-11-12 DOI: 10.1145/3126908.3126960

Daniel Oliveira, L. Pilla, Nathan Debardeleben, S. Blanchard, H. Quinn, I. Koren, P. Navaux, P. Rech

{"title":"Experimental and Analytical Study of Xeon Phi Reliability","authors":"Daniel Oliveira, L. Pilla, Nathan Debardeleben, S. Blanchard, H. Quinn, I. Koren, P. Navaux, P. Rech","doi":"10.1145/3126908.3126960","DOIUrl":"https://doi.org/10.1145/3126908.3126960","url":null,"abstract":"We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon Phi processors based on radiation experiments and high-level fault injection. Besides measuring the realistic error rates of Xeon Phi, we quantify Silent Data Corruption (SDCs) by correlating the distribution of corrupted elements in the output to the application’s characteristics. We evaluate the benefits of imprecise computing for reducing the programs’ error rate. For example, for HotSpot a 0.5% tolerance in the output value reduces the error rate by 85%. We inject different fault models to analyze the sensitivity of given applications. We show that portions of applications can be graded by different criticalities. For example, faults occurring in the middle of LUD execution, or in the Sort and Tree portions of CLAMR, are more critical than the remaining portions. Mitigation techniques can then be relaxed or hardened based on the criticality of the particular portions.CCS CONCEPTS• Computer systems organization → Parallel architectures; • Hardware → Fault tolerance;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115477890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios “神威太湖之光”上18.9 pflops非线性地震模拟:18hz和8米情景的使能描述

SC17: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2017-11-12 DOI: 10.1145/3126908.3126910

H. Fu, Conghui He, Bingwei Chen, Zekun Yin, Zhenguo Zhang, Wenqiang Zhang, Tingjian Zhang, Wei Xue, Weiguo Liu, Wanwang Yin, Guangwen Yang, Xiaofei Chen

{"title":"18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios","authors":"H. Fu, Conghui He, Bingwei Chen, Zekun Yin, Zhenguo Zhang, Wenqiang Zhang, Tingjian Zhang, Wei Xue, Weiguo Liu, Wanwang Yin, Guangwen Yang, Xiaofei Chen","doi":"10.1145/3126908.3126910","DOIUrl":"https://doi.org/10.1145/3126908.3126910","url":null,"abstract":"This paper reports our large-scale nonlinear earthquake simulation software on Sunway TaihuLight. Our innovations include: (1) a customized parallelization scheme that employs the 10 million cores efficiently at both the process and the thread levels; (2) an elaborate memory scheme that integrates on-chip halo exchange through register communcation, optimized blocking configuration guided by an analytic model, and coalesced DMA access with array fusion; (3) on-the-fly compression that doubles the maximum problem size and further improves the performance by 24%. With these innovations to remove the memory constraints of Sunway TaihuLight, our software achieves over 15% of the system’s peak, better than the 11.8% efficiency achieved by a similar software running on Titan, whose byte to flop ratio is 5 times better than TaihuLight. The extreme cases demonstrate a sustained performance of over 18.9 Pflops, enabling the simulation of Tangshan earthquake as an 18-Hz scenario with an 8-meter resolution.","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116880416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 96