Topology-Aware Data Aggregation for High Performance Collective MPI-IO on a Multi-core Cluster System

2016 Fourth International Symposium on Computing and Networking (CANDAR) Pub Date : 2016-11-01 DOI:10.1109/CANDAR.2016.0022

Y. Tsujita, A. Hori, Toyohisa Kameyama, Y. Ishikawa

{"title":"Topology-Aware Data Aggregation for High Performance Collective MPI-IO on a Multi-core Cluster System","authors":"Y. Tsujita, A. Hori, Toyohisa Kameyama, Y. Ishikawa","doi":"10.1109/CANDAR.2016.0022","DOIUrl":null,"url":null,"abstract":"Parallel I/O such as MPI-IO is one of the performance improvement solutions in parallel computing using MPI. ROMIO is a widely used MPI-IO implementation which addresses to improve collective I/O performance by using its optimization named two-phase I/O. File I/O task is given to a subset of or all of MPI processes, which are called aggregators. Multiple CPUs or CPU cores give a chance to increase computing power by deploying multiple MPI processes per compute node, while such deployment leads to poor I/O performance due to ROMIO's topology-unaware aggregator layout. In our previous work, optimized aggregator layout which was suitable for striping accesses on a Lustre file system improved I/O performance, however, its unbalanced communication load due to unawareness in MPI rank layout among compute nodes led to ineffective data aggregation. To address minimization in data aggregation time for further I/O performance improvements, we introduce a topology-aware data aggregation scheme which takes care of MPI rank layout across compute nodes. The proposal arranges data collection sequence by aggregators in order to mitigate network contention. The optimization has achieved up to 67% improvements in I/O performance compared with the original ROMIO in HPIO benchmark runs using 768 processes on 64 compute nodes of the TSUBAME2.5 supercomputer at the Tokyo Institute of Technology. Even if the number of aggregators was half or 1/3 of the total number of processes, the optimization has still kept comparable I/O performance with the maximum performance.","PeriodicalId":322499,"journal":{"name":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CANDAR.2016.0022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Parallel I/O such as MPI-IO is one of the performance improvement solutions in parallel computing using MPI. ROMIO is a widely used MPI-IO implementation which addresses to improve collective I/O performance by using its optimization named two-phase I/O. File I/O task is given to a subset of or all of MPI processes, which are called aggregators. Multiple CPUs or CPU cores give a chance to increase computing power by deploying multiple MPI processes per compute node, while such deployment leads to poor I/O performance due to ROMIO's topology-unaware aggregator layout. In our previous work, optimized aggregator layout which was suitable for striping accesses on a Lustre file system improved I/O performance, however, its unbalanced communication load due to unawareness in MPI rank layout among compute nodes led to ineffective data aggregation. To address minimization in data aggregation time for further I/O performance improvements, we introduce a topology-aware data aggregation scheme which takes care of MPI rank layout across compute nodes. The proposal arranges data collection sequence by aggregators in order to mitigate network contention. The optimization has achieved up to 67% improvements in I/O performance compared with the original ROMIO in HPIO benchmark runs using 768 processes on 64 compute nodes of the TSUBAME2.5 supercomputer at the Tokyo Institute of Technology. Even if the number of aggregators was half or 1/3 of the total number of processes, the optimization has still kept comparable I/O performance with the maximum performance.

查看原文本刊更多论文

基于拓扑感知的多核集群系统高性能MPI-IO数据聚合

MPI- io等并行I/O是利用MPI实现并行计算的性能改进方案之一。ROMIO是一种广泛使用的MPI-IO实现，它通过使用称为两阶段I/O的优化来提高整体I/O性能。文件I/O任务被分配给MPI进程的一个子集或所有进程，这些进程被称为聚合器。通过在每个计算节点上部署多个MPI进程，多个CPU或CPU核心有机会提高计算能力，但是由于ROMIO的不了解拓扑的聚合器布局，这种部署会导致较差的I/O性能。在我们之前的工作中，优化了适合于Lustre文件系统上条带化访问的聚合器布局，提高了I/O性能，但是由于计算节点之间MPI排名布局的不感知导致其通信负载不平衡，导致数据聚合无效。为了最小化数据聚合时间以进一步提高I/O性能，我们引入了一种拓扑感知的数据聚合方案，该方案负责跨计算节点的MPI排名布局。该方案采用聚合器排列数据采集顺序，以减少网络争用。与在东京工业大学的TSUBAME2.5超级计算机的64个计算节点上使用768个进程运行的HPIO基准测试中的原始ROMIO相比，该优化实现了高达67%的I/O性能改进。即使聚合器的数量是进程总数的一半或1/3，优化仍然可以保持与最大性能相当的I/O性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 Fourth International Symposium on Computing and Networking (CANDAR)

自引率

0.00%

发文量