{"title":"Scalable, fault-tolerant job step management for high-performance systems","authors":"D. Solt;J. Hursey;A. Lauria;D. Guo;X. Guo","doi":"10.1147/JRD.2019.2958909","DOIUrl":"https://doi.org/10.1147/JRD.2019.2958909","url":null,"abstract":"Scientific applications on the CORAL systems demanded a fault-tolerant, scalable job launch infrastructure for complex workflows with multiple job steps within an allocation. The distinct design of IBM's Job Step Manager (JSM) infrastructure, working in concert with Load Sharing Facility (LSF) and Cluster System Management (CSM), achieves these goals. JSM demonstrated launching over three-quarters of a million processes in under a minute while providing efficient process management interface for exascale-based services to communication libraries, such as parallel active messaging interface and message passing interface, and tools over the management network. JSM relies on the parallel task support library to provide a fault-tolerant, scalable communication medium between the JSM daemons. Application workflows using job steps harness the unique resource set abstraction concept in JSM to manage CPUs, GPUs, and memory between groups of processes, possibly in discrete job steps, sharing a node. The resource set concept gives JSM the opportunity to better organize process placement to optimize, for example, CPU-to-GPU communication. Applications that need complete control over the shaping of the resource sets and the placement, binding, and ordering of processes within them can leverage JSM's co-designed Explicit Resource File mechanism. This article explores the design decisions, implementation considerations, and performance optimizations of IBM's JSM infrastructure to support scientific discovery on the CORAL systems.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2019-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2958909","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49948807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. T. Bercea;A. Bataev;A. E. Eichenberger;C. Bertolli;J. K. O'Brien
{"title":"An open-source solution to performance portability for Summit and Sierra supercomputers","authors":"G. T. Bercea;A. Bataev;A. E. Eichenberger;C. Bertolli;J. K. O'Brien","doi":"10.1147/JRD.2019.2955944","DOIUrl":"https://doi.org/10.1147/JRD.2019.2955944","url":null,"abstract":"Programming models that use a higher level of abstraction to express parallelism can target both CPUs and any attached devices, alleviating the maintainability and portability concerns facing today's heterogenous systems. This article describes the design, implementation, and delivery of a compliant OpenMP device offloading implementation for IBM-NVIDIA heterogeneous servers composing the Summit and Sierra supercomputers in the mainline open-source Clang/LLVM compiler and OpenMP runtime projects. From a performance perspective, reconciling the GPU programming model, best suited for massively parallel workloads, with the generality of the OpenMP model was a significant challenge. To achieve both high performance and full portability, we map high-level programming patterns to fine-tuned code generation schemes and customized runtimes that preserve the OpenMP semantics. In the compiler, we implement a low-overhead single-program multiple-data scheme that leverages the GPU native execution model and a fallback scheme to support the generality of OpenMP. Modular design enables the implementation to be extended with new schemes for frequently occurring patterns. Our implementation relies on key optimizations: sharing data among threads, leveraging unified memory, aggressive inlining of runtime calls, memory coalescing, and runtime simplification. We show that for commonly used patterns, performance on the Summit and Sierra GPUs matches that of hand-written native CUDA code.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2019-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2955944","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49948699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. A. Beckingsale;M. J. McFadden;J. P. S. Dahm;R. Pankajakshan;R. D. Hornung
{"title":"Umpire: Application-focused management and coordination of complex hierarchical memory","authors":"D. A. Beckingsale;M. J. McFadden;J. P. S. Dahm;R. Pankajakshan;R. D. Hornung","doi":"10.1147/JRD.2019.2954403","DOIUrl":"https://doi.org/10.1147/JRD.2019.2954403","url":null,"abstract":"Advanced architectures like Sierra provide a wide range of memory resources that must often be carefully controlled by the user. These resources have varying capacities, access timing rules, and visibility to different compute resources. Applications must intelligently allocate data in these spaces, and depending on the total amount of memory required, applications may also be forced to move data between different parts of the memory hierarchy. Finally, applications using multiple packages must coordinate effectively to ensure that each package can use the memory resources it needs. To address these challenges, we present Umpire, an application-oriented library for managing memory resources. Specifically, Umpire provides support for querying memory resources, provisioning and allocating memory, and memory introspection. It allows computer scientists and computational physicists to efficiently program the memory hierarchies of current and future high-performance computing architectures, without tying their application to specific hardware or software. In this article, we describe the design and implementation of Umpire and present case studies from the integration of Umpire into applications that are currently running on Sierra.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2954403","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49948702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Disaster management in the digital age","authors":"J. W. Talley","doi":"10.1147/JRD.2019.2954412","DOIUrl":"https://doi.org/10.1147/JRD.2019.2954412","url":null,"abstract":"The United States is one of the most natural disaster-prone countries in the world. Since 1980, there have been 246 weather and climate disasters exceeding $1.6 trillion in remediation. Within the last decade, the frequency of disaster events and their costs are on the rise. Complicating the impact of natural disasters is the population shift to cities and coastal areas, which concentrate their effects. The need for governments and communities to prepare for, respond to, and recover from disasters is greater than ever before. Disaster management is a big data problem that requires a public private partnership solution. Technology is the connection that can link end-to-end capabilities across multiple organizations for disaster management in the digital age. But how can technologies like cloud, artificial intelligence (AI), and predictive analytics be leveraged across all aspects of the disaster management life cycle? This article briefly addresses these questions and more. Two case studies and technology spotlights are used to reinforce discussion around traditional and new approaches to the management of natural disasters.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2954412","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49986744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Enenkel;R. M. Shrestha;E. Stokes;M. Román;Z. Wang;M. T. M. Espinosa;I. Hajzmanova;J. Ginnetti;P. Vinck
{"title":"Emergencies do not stop at night: Advanced analysis of displacement based on satellite-derived nighttime light observations","authors":"M. Enenkel;R. M. Shrestha;E. Stokes;M. Román;Z. Wang;M. T. M. Espinosa;I. Hajzmanova;J. Ginnetti;P. Vinck","doi":"10.1147/JRD.2019.2954404","DOIUrl":"https://doi.org/10.1147/JRD.2019.2954404","url":null,"abstract":"Around 68.5 million people are currently forcibly displaced. The implementation and monitoring of international agreements, which are linked to the 2030 agenda (e.g., the Sendai Framework), require a standard set of metrics for internal displacement. Since nationally owned, validated, and credible data are difficult to obtain, new approaches are needed. This article aims to support the monitoring of displacement via satellite-derived observations of nighttime lights (NTL) from NASA's Black Marble product suite along with an short message service (SMS)-based emergency survey after Cyclone Idai had made landfall in Beira, Mozambique, in March 2019. Under certain conditions, the spatial extent of power outages can serve as a proxy for disaster impacts and a potential driver for displacement. Hence, information about anomalies in NTL has the potential to support humanitarian decision-making via estimations of people affected or the coordination of rapid response teams. Despite initial issues related to cloud cover, we find that around 90% of Beira's power grid had been affected. In collaboration with the Internal Displacement Monitoring Center, we use these findings to establish a framework that links NTL observations with existing humanitarian decision-making workflows to complement ground-based survey data and other satellite-derived information, such as flood or damage maps.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2019-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2954404","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49953418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Preface: Hardware for AI","authors":"","doi":"10.1147/JRD.2019.2945553","DOIUrl":"10.1147/JRD.2019.2945553","url":null,"abstract":"","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2019-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2945553","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41516348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Preface: AI Ethics","authors":"","doi":"10.1147/JRD.2019.2944775","DOIUrl":"10.1147/JRD.2019.2944775","url":null,"abstract":"","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2019-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2944775","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48612109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cover","authors":"","doi":"10.1147/JRD.2019.2948185","DOIUrl":"https://doi.org/10.1147/JRD.2019.2948185","url":null,"abstract":"","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2019-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/5288520/8894910/08895604.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49993121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cover","authors":"","doi":"10.1147/JRD.2019.2948187","DOIUrl":"https://doi.org/10.1147/JRD.2019.2948187","url":null,"abstract":"","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2019-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/5288520/8894910/08894911.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49993117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Culture and cognition: Understanding public perceptions of risk and (in)action","authors":"T. Allen;E. Wells;K. Klima","doi":"10.1147/JRD.2019.2952330","DOIUrl":"https://doi.org/10.1147/JRD.2019.2952330","url":null,"abstract":"Much is known about the effects of risk on behavior and communication, yet little research has considered how these risks influence modes of cultural and cognitive processing dynamics that underlie public perceptions, communications, and social (in)action. This article presents a psychological model of risk communications that demonstrates how cognitive structure, cultural schema, and environment awareness could be combined to improve risk communication. We illustrate the explanatory value of the model's usefulness on two qualitative case studies: one on decision-makers facing extreme heat, and another on homeowners facing flood events. Consistent with the model predictions, we find that cognitive structure, cultural schema, and environment awareness dynamics are not only necessary determinants to strengthen risk communications, but also important for understanding perceptions of risk and people's (in)action to engage in mitigation and adoption efforts. This suggests that decision-makers hoping to reduce disaster risk or improve disaster resilience may wish to consider how these three dynamics exist and interact.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2019-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2952330","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49986749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}