Feasibility of Running Singularity Containers with Hybrid MPI on NASA High-End Computing Resources

2021 3rd International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC) Pub Date : 2021-11-01 DOI:10.1109/CANOPIEHPC54579.2021.00007

Y. Chang, S. Heistand, R. Hood, Henry Jin

{"title":"Feasibility of Running Singularity Containers with Hybrid MPI on NASA High-End Computing Resources","authors":"Y. Chang, S. Heistand, R. Hood, Henry Jin","doi":"10.1109/CANOPIEHPC54579.2021.00007","DOIUrl":null,"url":null,"abstract":"This work investigates the feasibility of a Singularity container-based solution to support a customizable computing environment for running users' MPI applications in “hybrid” MPI mode-where the MPI on the host machine works in tandem with MPI inside the container-on NASA's High-End Computing Capability (HECC) resources. Two types of real-world applications were tested: traditional High-Performance Computing (HPC) and Artificial Intelligence/Machine Learning (AI/ML). On the traditional HPC side, two JEDI containers built with Intel MPI for Earth science modeling were tested on both HECC in-house and HECC AWS Cloud CPU resources. On the AI/ML side, a NVIDIA TensorFlow container built with OpenMPI was tested with a Neural Collaborative Filtering recommender system and the ResNet-50 computer image system on the HECC in-house V100 GPUs. For each of these applications and resource environments, multiple hurdles were overcome after lengthy debugging efforts. Among them, the most significant ones were due to the conflicts between a host MPI and a container MPI and the complexity of the communication layers underneath. Although porting containers to run with a single node using just the container MPI is quite straightforward, our exercises demonstrate that running across multiple nodes in hybrid MPI mode requires knowledge of Singularity, MPI libraries, the operating system image, and the communication infrastructure such as the transport and network layers, which are traditionally handled by support staff of HPC centers and hardware or software vendors. In conclusion, porting and running Singularity containers on HECC resources or other data centers with similar environments is feasible but most users would need help to run them in hybrid MPI mode.","PeriodicalId":237957,"journal":{"name":"2021 3rd International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 3rd International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CANOPIEHPC54579.2021.00007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

This work investigates the feasibility of a Singularity container-based solution to support a customizable computing environment for running users' MPI applications in “hybrid” MPI mode-where the MPI on the host machine works in tandem with MPI inside the container-on NASA's High-End Computing Capability (HECC) resources. Two types of real-world applications were tested: traditional High-Performance Computing (HPC) and Artificial Intelligence/Machine Learning (AI/ML). On the traditional HPC side, two JEDI containers built with Intel MPI for Earth science modeling were tested on both HECC in-house and HECC AWS Cloud CPU resources. On the AI/ML side, a NVIDIA TensorFlow container built with OpenMPI was tested with a Neural Collaborative Filtering recommender system and the ResNet-50 computer image system on the HECC in-house V100 GPUs. For each of these applications and resource environments, multiple hurdles were overcome after lengthy debugging efforts. Among them, the most significant ones were due to the conflicts between a host MPI and a container MPI and the complexity of the communication layers underneath. Although porting containers to run with a single node using just the container MPI is quite straightforward, our exercises demonstrate that running across multiple nodes in hybrid MPI mode requires knowledge of Singularity, MPI libraries, the operating system image, and the communication infrastructure such as the transport and network layers, which are traditionally handled by support staff of HPC centers and hardware or software vendors. In conclusion, porting and running Singularity containers on HECC resources or other data centers with similar environments is feasible but most users would need help to run them in hybrid MPI mode.

查看原文本刊更多论文

混合MPI在NASA高端计算资源上运行奇点容器的可行性

这项工作调查了基于Singularity容器的解决方案的可行性，该解决方案支持在NASA的高端计算能力(HECC)资源上以“混合”MPI模式(主机上的MPI与容器内的MPI协同工作)运行用户的MPI应用程序的可定制计算环境。测试了两种类型的实际应用:传统的高性能计算(HPC)和人工智能/机器学习(AI/ML)。在传统的高性能计算方面，使用英特尔MPI构建的两个用于地球科学建模的JEDI容器在HECC内部和HECC AWS云CPU资源上进行了测试。在AI/ML方面，使用OpenMPI构建的NVIDIA TensorFlow容器在hec内部V100 gpu上使用神经协同过滤推荐系统和ResNet-50计算机图像系统进行了测试。对于这些应用程序和资源环境中的每一个，经过长时间的调试工作，克服了多个障碍。其中，最重要的是由于主机MPI和容器MPI之间的冲突以及其下通信层的复杂性。虽然将容器移植到仅使用容器MPI的单个节点上运行非常简单，但我们的练习表明，在混合MPI模式下跨多个节点运行需要了解Singularity、MPI库、操作系统映像以及通信基础设施(如传输和网络层)，这些传统上由HPC中心和硬件或软件供应商的支持人员处理。总之，在hec资源或其他具有类似环境的数据中心上移植和运行Singularity容器是可行的，但大多数用户需要帮助才能在混合MPI模式下运行它们。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 3rd International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC)

自引率

0.00%

发文量