我是如何学会不再担心用户可见的端点而爱上MPI的

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-05-01 DOI:10.1145/3392717.3392773

Rohit Zambre, Aparna Chandramowlishwaran, P. Balaji

{"title":"我是如何学会不再担心用户可见的端点而爱上MPI的","authors":"Rohit Zambre, Aparna Chandramowlishwaran, P. Balaji","doi":"10.1145/3392717.3392773","DOIUrl":null,"url":null,"abstract":"MPI+threads is gaining prominence as an alternative to the traditional \"MPI everywhere\" model in order to better handle the disproportionate increase in the number of cores compared with other on-node resources. However, the communication performance of MPI+threads can be 100x slower than that of MPI everywhere. Both MPI users and developers are to blame for this slowdown. MPI users traditionally have not exposed logical communication parallelism. Consequently, MPI libraries have used conservative approaches, such as a global critical section, to maintain MPI's ordering constraints for MPI+threads, thus serializing access to the underlying parallel network resources and limiting performance. To enhance the communication performance of MPI+threads, researchers have proposed MPI Endpoints as a user-visible extension to the MPI-3.1 standard. MPI Endpoints allows a single process to create multiple MPI ranks within a communicator. This could, in theory, allow each thread to have a dedicated communication path to the network, thus avoiding resource contention between threads and improving performance. The onus of mapping threads to endpoints, however, would then be on domain scientists. In this paper we play the role of devil's advocate and question the need for such user-visible endpoints. We certainly agree that dedicated communication channels are critical. To what extent, however, can we hide these channels inside the MPI library without modifying the MPI standard and thus unburden the user? More important, what functionality would we lose through such abstraction? This paper answers these questions through a new implementation of the MPI-3.1 standard that uses multiple virtual communication interfaces (VCIs) inside the MPI library. VCIs abstract underlying network contexts. When users expose parallelism through existing MPI mechanisms, the MPI library maps that parallelism to the VCIs, relieving the domain scientists from worrying about endpoints. We identify cases where user-exposed parallelism on VCIs perform as well as user-visible endpoints, as well as cases where such abstraction hurts performance.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"88 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"How I learned to stop worrying about user-visible endpoints and love MPI\",\"authors\":\"Rohit Zambre, Aparna Chandramowlishwaran, P. Balaji\",\"doi\":\"10.1145/3392717.3392773\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"MPI+threads is gaining prominence as an alternative to the traditional \\\"MPI everywhere\\\" model in order to better handle the disproportionate increase in the number of cores compared with other on-node resources. However, the communication performance of MPI+threads can be 100x slower than that of MPI everywhere. Both MPI users and developers are to blame for this slowdown. MPI users traditionally have not exposed logical communication parallelism. Consequently, MPI libraries have used conservative approaches, such as a global critical section, to maintain MPI's ordering constraints for MPI+threads, thus serializing access to the underlying parallel network resources and limiting performance. To enhance the communication performance of MPI+threads, researchers have proposed MPI Endpoints as a user-visible extension to the MPI-3.1 standard. MPI Endpoints allows a single process to create multiple MPI ranks within a communicator. This could, in theory, allow each thread to have a dedicated communication path to the network, thus avoiding resource contention between threads and improving performance. The onus of mapping threads to endpoints, however, would then be on domain scientists. In this paper we play the role of devil's advocate and question the need for such user-visible endpoints. We certainly agree that dedicated communication channels are critical. To what extent, however, can we hide these channels inside the MPI library without modifying the MPI standard and thus unburden the user? More important, what functionality would we lose through such abstraction? This paper answers these questions through a new implementation of the MPI-3.1 standard that uses multiple virtual communication interfaces (VCIs) inside the MPI library. VCIs abstract underlying network contexts. When users expose parallelism through existing MPI mechanisms, the MPI library maps that parallelism to the VCIs, relieving the domain scientists from worrying about endpoints. We identify cases where user-exposed parallelism on VCIs perform as well as user-visible endpoints, as well as cases where such abstraction hurts performance.\",\"PeriodicalId\":346687,\"journal\":{\"name\":\"Proceedings of the 34th ACM International Conference on Supercomputing\",\"volume\":\"88 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 34th ACM International Conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3392717.3392773\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th ACM International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3392717.3392773","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

MPI+线程作为传统的“无处不在的MPI”模型的替代方案正在获得突出地位，以便更好地处理与其他节点上资源相比内核数量不成比例的增加。然而，MPI+线程的通信性能在任何地方都可能比MPI慢100倍。MPI用户和开发人员都应该为这种减速负责。MPI用户传统上没有暴露逻辑通信并行性。因此，MPI库使用了保守的方法，例如全局临界区，来维护MPI对MPI+线程的排序约束，从而序列化对底层并行网络资源的访问并限制性能。为了提高MPI+线程的通信性能，研究人员提出了MPI端点作为MPI-3.1标准的用户可见扩展。MPI端点允许单个进程在通信器内创建多个MPI等级。理论上，这可以允许每个线程拥有通往网络的专用通信路径，从而避免线程之间的资源争用并提高性能。然而，将线程映射到端点的责任将由领域科学家承担。在本文中，我们扮演了魔鬼倡导者的角色，并质疑这种用户可见端点的必要性。我们当然同意，专门的沟通渠道至关重要。然而，在不修改MPI标准的情况下，我们可以在多大程度上将这些通道隐藏在MPI库中，从而减轻用户的负担?更重要的是，这样的抽象会使我们失去哪些功能?本文通过MPI-3.1标准的新实现来回答这些问题，该标准在MPI库中使用多个虚拟通信接口(vci)。vci抽象底层网络上下文。当用户通过现有的MPI机制公开并行性时，MPI库将该并行性映射到vci，从而使领域科学家不必担心端点。我们确定了在vci上用户暴露的并行性和用户可见的端点执行的情况，以及这种抽象损害性能的情况。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

How I learned to stop worrying about user-visible endpoints and love MPI

MPI+threads is gaining prominence as an alternative to the traditional "MPI everywhere" model in order to better handle the disproportionate increase in the number of cores compared with other on-node resources. However, the communication performance of MPI+threads can be 100x slower than that of MPI everywhere. Both MPI users and developers are to blame for this slowdown. MPI users traditionally have not exposed logical communication parallelism. Consequently, MPI libraries have used conservative approaches, such as a global critical section, to maintain MPI's ordering constraints for MPI+threads, thus serializing access to the underlying parallel network resources and limiting performance. To enhance the communication performance of MPI+threads, researchers have proposed MPI Endpoints as a user-visible extension to the MPI-3.1 standard. MPI Endpoints allows a single process to create multiple MPI ranks within a communicator. This could, in theory, allow each thread to have a dedicated communication path to the network, thus avoiding resource contention between threads and improving performance. The onus of mapping threads to endpoints, however, would then be on domain scientists. In this paper we play the role of devil's advocate and question the need for such user-visible endpoints. We certainly agree that dedicated communication channels are critical. To what extent, however, can we hide these channels inside the MPI library without modifying the MPI standard and thus unburden the user? More important, what functionality would we lose through such abstraction? This paper answers these questions through a new implementation of the MPI-3.1 standard that uses multiple virtual communication interfaces (VCIs) inside the MPI library. VCIs abstract underlying network contexts. When users expose parallelism through existing MPI mechanisms, the MPI library maps that parallelism to the VCIs, relieving the domain scientists from worrying about endpoints. We identify cases where user-exposed parallelism on VCIs perform as well as user-visible endpoints, as well as cases where such abstraction hurts performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 34th ACM International Conference on Supercomputing

自引率

0.00%

发文量