IBM Journal of Research and Development最新文献_第8页

Building a high-performance resilient scalable storage cluster for CORAL using IBM ESS 使用IBM ESS为CORAL构建高性能弹性可扩展存储集群

IF 1.3 4区计算机科学

IBM Journal of Research and Development Pub Date : 2019-12-25 DOI: 10.1147/JRD.2019.2962452

R. Islam;G. Shah

{"title":"Building a high-performance resilient scalable storage cluster for CORAL using IBM ESS","authors":"R. Islam;G. Shah","doi":"10.1147/JRD.2019.2962452","DOIUrl":"https://doi.org/10.1147/JRD.2019.2962452","url":null,"abstract":"A high-performance, scalable, and resilient storage subsystem is essential for delivering and maintaining consistent performance and high utilization expected from a modern supercomputer. IBM delivered two systems under the CORAL program, both of which used IBM Spectrum Scale and IBM Elastic Storage Server (ESS) as the storage solution. The larger of the two CORAL clusters is composed of 77 building blocks of ESS, each of which consists of a pair of high-performance I/O Server nodes connected to four high-density storage enclosures. These ESS building blocks are interconnected via a redundant InfiniBand EDR network to form a storage cluster that provides a global namespace aggregating performance over 32,000 commodity disks. The IBM Spectrum Scale for ESS runs high-performance erasure coding on each building block and provides a single global name space across all the building blocks. The IBM Spectrum Scale features deliver a highly resilient, high-performance storage subsystem using ESS. These features include recent improvements for efficient buffer management and fast efficient low-latency communication. CORAL I/O performance results include large-block streaming throughput of over 2.4 TB/s, ability to create over 1 M 32-KB files per second, and enabling an aggregate rate of 30 K zero-length file creates per second in a shared directory from multiple nodes. This article describes the design and implementation of the ESS storage cluster; the innovations required to meet the performance, scale, manageability, and reliability goals; and challenges we had to overcome as we deployed a system of such unprecedented I/O capabilities.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"4:1-4:9"},"PeriodicalIF":1.3,"publicationDate":"2019-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49948803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sierra Center of Excellence: Lessons learned Sierra卓越中心:经验教训

IF 1.3 4区计算机科学

IBM Journal of Research and Development Pub Date : 2019-12-20 DOI: 10.1147/JRD.2019.2961069

J. P. Dahm;D. F. Richards;A. Black;A. D. Bertsch;L. Grinberg;I. Karlin;S. Kokkila-Schumacher;E. A. León;J. R. Neely;R. Pankajakshan;O. Pearce

引用次数: 2

Transformation of application enablement tools on CORAL systems CORAL系统上应用程序启用工具的转换

IF 1.3 4区计算机科学

IBM Journal of Research and Development Pub Date : 2019-12-17 DOI: 10.1147/JRD.2019.2960246

S. Maerean;E. K. Lee;H.-F. Wen;I-H. Chung

引用次数: 0

Call for Code: Developers tackle natural disasters with software 呼吁代码:开发人员用软件解决自然灾害

IF 1.3 4区计算机科学

IBM Journal of Research and Development Pub Date : 2019-12-17 DOI: 10.1147/JRD.2019.2960241

D. Krook;S. Malaika

引用次数: 1

A unique approach to corporate disaster philanthropy focused on delivering technology and expertise 一种独特的企业灾难慈善方式，专注于提供技术和专业知识

IF 1.3 4区计算机科学

IBM Journal of Research and Development Pub Date : 2019-12-17 DOI: 10.1147/JRD.2019.2960244

R. E. Curzon;P. Curotto;M. Evason;A. Failla;P. Kusterer;A. Ogawa;J. Paraszczak;S. Raghavan

引用次数: 1

The CORAL supercomputer systems 珊瑚超级计算机系统

IF 1.3 4区计算机科学

IBM Journal of Research and Development Pub Date : 2019-12-17 DOI: 10.1147/JRD.2019.2960220

W. A. Hanson

引用次数: 11

Porting a 3D seismic modeling code (SW4) to CORAL machines 将3D地震建模代码(SW4)移植到CORAL机器上

IF 1.3 4区计算机科学

IBM Journal of Research and Development Pub Date : 2019-12-17 DOI: 10.1147/JRD.2019.2960218

R. Pankajakshan;P.-H. Lin;B. Sjögreen

引用次数: 2

Hybrid CPU/GPU tasks optimized for concurrency in OpenMP 在OpenMP中为并发性优化的混合CPU/GPU任务

IF 1.3 4区计算机科学

IBM Journal of Research and Development Pub Date : 2019-12-17 DOI: 10.1147/JRD.2019.2960245

A. E. Eichenberger;G.-T. Bercea;A. Bataev;L. Grinberg;J. K. O'Brien

{"title":"Hybrid CPU/GPU tasks optimized for concurrency in OpenMP","authors":"A. E. Eichenberger;G.-T. Bercea;A. Bataev;L. Grinberg;J. K. O'Brien","doi":"10.1147/JRD.2019.2960245","DOIUrl":"https://doi.org/10.1147/JRD.2019.2960245","url":null,"abstract":"Sierra and Summit supercomputers exhibit a significant amount of intranode parallelism between the host POWER9 CPUs and their attached GPU devices. In this article, we show that exploiting device-level parallelism is key to achieving high performance by reducing overheads typically associated with CPU and GPU task execution. Moreover, manually exploiting this type of parallelism in large-scale applications is nontrivial and error-prone. We hide the complexity of exploiting this hybrid intranode parallelism using the OpenMP programming model abstraction. The implementation leverages the semantics of OpenMP tasks to express asynchronous task computations and their associated dependences. Launching tasks on the CPU threads requires a careful design of work-stealing algorithms to provide efficient load balancing among CPU threads. We propose a novel algorithm that removes locks from all task queueing operations that are on the critical path. Tasks assigned to GPU devices require additional steps such as copying input data to GPU devices, launching the computation kernels, and copying data back to the host CPU memory. We perform key optimizations to reduce the cost of these additional steps by tightly integrating data transfers and GPU computations into streams of asynchronous GPU operations. We further map high-level dependences between GPU tasks to the same asynchronous GPU streams to further avoid unnecessary synchronization. Results validate our approach.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"13:1-13:14"},"PeriodicalIF":1.3,"publicationDate":"2019-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2960245","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49948700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Quantitative modeling in disaster management: A literature review 灾害管理中的定量建模:文献综述

IF 1.3 4区计算机科学

IBM Journal of Research and Development Pub Date : 2019-12-17 DOI: 10.1147/JRD.2019.2960356

A. E. Baxter;H. E. Wilborn Lagerman;P. Keskinocak

引用次数: 12

Troubleshooting deep-learner training data problems using an evolutionary algorithm on Summit 在Summit上使用进化算法解决深度学习者训练数据问题

IF 1.3 4区计算机科学

IBM Journal of Research and Development Pub Date : 2019-12-17 DOI: 10.1147/JRD.2019.2960225

M. Coletti;A. Fafard;D. Page

{"title":"Troubleshooting deep-learner training data problems using an evolutionary algorithm on Summit","authors":"M. Coletti;A. Fafard;D. Page","doi":"10.1147/JRD.2019.2960225","DOIUrl":"https://doi.org/10.1147/JRD.2019.2960225","url":null,"abstract":"Architectural and hyperparameter design choices can influence deep-learner (DL) model fidelity but can also be affected by malformed training and validation data. However, practitioners may spend significant time refining layers and hyperparameters before discovering that distorted training data were impeding the training progress. We found that an evolutionary algorithm (EA) can be used to troubleshoot this kind of DL problem. An EA evaluated thousands of DL configurations on Summit that yielded no overall improvement in DL performance, which suggested problems with the training and validation data. We suspected that contrast limited adaptive histogram equalization enhancement that was applied to previously generated digital surface models, for which we were training DLs to find errors, had damaged the training data. Subsequent runs with an alternative global normalization yielded significantly improved DL performance. However, the DL intersection over unions still exhibited consistent subpar performance, which suggested further problems with the training data and DL approach. Nonetheless, we were able to diagnose this problem within a 12-hour span via Summit runs, which prevented several weeks of unproductive trial-and-error DL configuration refinement and allowed for a more timely convergence on an ultimately viable solution.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"1-12"},"PeriodicalIF":1.3,"publicationDate":"2019-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2960225","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49948705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6