{"title":"用MPI实现高性能Coarray Fortran的源到源转换","authors":"H. Iwashita, M. Nakao, H. Murai, M. Sato","doi":"10.1145/3149457.3155888","DOIUrl":null,"url":null,"abstract":"Coarray Fortran (CAF) is a partitioned global address space (PGAS) language that is a part of standard Fortran 2008. We have implemented it as a source-to-source translator as a part of the Omni XcalebleMP compiler. Since the output is written in Fortran standard, the translator must utilize Fortran conventions such as the assumed-shape array and generic function in order to reduce both development costs and runtime overhead. The runtime library uses either GASNet, MPI-3, or Fujitsu's low-level Remote Direct Memory Access (RDMA) interface (FJ-RDMA) for one-sided communication. The Omni CAF translator and the runtime library support three types of memory managers that allocate coarray variables and register them to the communication library. The runtime library for the PUT/GET communication detects how contiguous and periodic the source and destination data are and performs communication aggregation. We measured fundamental performance levels by using EPCC Fortran Coarray microbenchmark and found our implementation of PUT/GET communication provides bandwidth as high as MPI_Send/Recv on two supercomputers. Although the small data latency was larger than the one of MPI_Send/Recv, we found that it could be reduced by using non-blocking communication for multiple coarray variables. As a result, when using 1024 processes, we achieved 27% and 42% higher performance than the original MPI code in the Himeno Benchmark classes L and XL, respectively.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"A Source-to-Source Translation of Coarray Fortran with MPI for High Performance\",\"authors\":\"H. Iwashita, M. Nakao, H. Murai, M. Sato\",\"doi\":\"10.1145/3149457.3155888\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Coarray Fortran (CAF) is a partitioned global address space (PGAS) language that is a part of standard Fortran 2008. We have implemented it as a source-to-source translator as a part of the Omni XcalebleMP compiler. Since the output is written in Fortran standard, the translator must utilize Fortran conventions such as the assumed-shape array and generic function in order to reduce both development costs and runtime overhead. The runtime library uses either GASNet, MPI-3, or Fujitsu's low-level Remote Direct Memory Access (RDMA) interface (FJ-RDMA) for one-sided communication. The Omni CAF translator and the runtime library support three types of memory managers that allocate coarray variables and register them to the communication library. The runtime library for the PUT/GET communication detects how contiguous and periodic the source and destination data are and performs communication aggregation. We measured fundamental performance levels by using EPCC Fortran Coarray microbenchmark and found our implementation of PUT/GET communication provides bandwidth as high as MPI_Send/Recv on two supercomputers. Although the small data latency was larger than the one of MPI_Send/Recv, we found that it could be reduced by using non-blocking communication for multiple coarray variables. As a result, when using 1024 processes, we achieved 27% and 42% higher performance than the original MPI code in the Himeno Benchmark classes L and XL, respectively.\",\"PeriodicalId\":314778,\"journal\":{\"name\":\"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-01-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3149457.3155888\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3149457.3155888","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Source-to-Source Translation of Coarray Fortran with MPI for High Performance
Coarray Fortran (CAF) is a partitioned global address space (PGAS) language that is a part of standard Fortran 2008. We have implemented it as a source-to-source translator as a part of the Omni XcalebleMP compiler. Since the output is written in Fortran standard, the translator must utilize Fortran conventions such as the assumed-shape array and generic function in order to reduce both development costs and runtime overhead. The runtime library uses either GASNet, MPI-3, or Fujitsu's low-level Remote Direct Memory Access (RDMA) interface (FJ-RDMA) for one-sided communication. The Omni CAF translator and the runtime library support three types of memory managers that allocate coarray variables and register them to the communication library. The runtime library for the PUT/GET communication detects how contiguous and periodic the source and destination data are and performs communication aggregation. We measured fundamental performance levels by using EPCC Fortran Coarray microbenchmark and found our implementation of PUT/GET communication provides bandwidth as high as MPI_Send/Recv on two supercomputers. Although the small data latency was larger than the one of MPI_Send/Recv, we found that it could be reduced by using non-blocking communication for multiple coarray variables. As a result, when using 1024 processes, we achieved 27% and 42% higher performance than the original MPI code in the Himeno Benchmark classes L and XL, respectively.