{"title":"为高性能计算和人工智能设计高性能和可扩展中间件的挑战和机遇:过去,现在和未来","authors":"D. Panda","doi":"10.1109/ipdps53621.2022.00009","DOIUrl":null,"url":null,"abstract":"This talk focuses on challenges and opportunities emerging over the years (past, present, and future) in designing middleware for HPC and AI (Deep/Machine Learning) workloads on modern high-end computing systems. The talk initially presents the challenges in designing HPC runtime environments with MPI+X programming models by considering support for dense multi-core CPUs, high-performance interconnects, GPUs, and emerging DPUs. Advanced designs and solutions (such as RDMA, in-network computing, GPUDirect RDMA, on-the-fly compression) to exploit novel features of these emerging technologies and their benefits in the context of MVAPICH2 libraries are presented. Next, the talk focuses on MPI-driven solutions for the Deep/Machine Learning domains to extract performance and scalability for popular Deep Learning frameworks, large out-of-core models, GPUs, and DPUs. MPI-driven solutions to accelerate data science applications like Dask are highlighted. Challenges and experiences in deploying this middleware to the HPC cloud environments for Azure, AWS, and Oracle Cloud are presented. The talk concludes with an overview of the newly established NSF-AI Institute ICICLE (https://icicle.osu.edu/) to address challenges in designing future high-performance edge-to-HPC/ cloud middleware for AI-driven data-intensive applications.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Challenges and Opportunities in Designing High-Performance and Scalable Middleware for HPC and AI: Past, Present, and Future\",\"authors\":\"D. Panda\",\"doi\":\"10.1109/ipdps53621.2022.00009\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This talk focuses on challenges and opportunities emerging over the years (past, present, and future) in designing middleware for HPC and AI (Deep/Machine Learning) workloads on modern high-end computing systems. The talk initially presents the challenges in designing HPC runtime environments with MPI+X programming models by considering support for dense multi-core CPUs, high-performance interconnects, GPUs, and emerging DPUs. Advanced designs and solutions (such as RDMA, in-network computing, GPUDirect RDMA, on-the-fly compression) to exploit novel features of these emerging technologies and their benefits in the context of MVAPICH2 libraries are presented. Next, the talk focuses on MPI-driven solutions for the Deep/Machine Learning domains to extract performance and scalability for popular Deep Learning frameworks, large out-of-core models, GPUs, and DPUs. MPI-driven solutions to accelerate data science applications like Dask are highlighted. Challenges and experiences in deploying this middleware to the HPC cloud environments for Azure, AWS, and Oracle Cloud are presented. The talk concludes with an overview of the newly established NSF-AI Institute ICICLE (https://icicle.osu.edu/) to address challenges in designing future high-performance edge-to-HPC/ cloud middleware for AI-driven data-intensive applications.\",\"PeriodicalId\":321801,\"journal\":{\"name\":\"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ipdps53621.2022.00009\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ipdps53621.2022.00009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Challenges and Opportunities in Designing High-Performance and Scalable Middleware for HPC and AI: Past, Present, and Future
This talk focuses on challenges and opportunities emerging over the years (past, present, and future) in designing middleware for HPC and AI (Deep/Machine Learning) workloads on modern high-end computing systems. The talk initially presents the challenges in designing HPC runtime environments with MPI+X programming models by considering support for dense multi-core CPUs, high-performance interconnects, GPUs, and emerging DPUs. Advanced designs and solutions (such as RDMA, in-network computing, GPUDirect RDMA, on-the-fly compression) to exploit novel features of these emerging technologies and their benefits in the context of MVAPICH2 libraries are presented. Next, the talk focuses on MPI-driven solutions for the Deep/Machine Learning domains to extract performance and scalability for popular Deep Learning frameworks, large out-of-core models, GPUs, and DPUs. MPI-driven solutions to accelerate data science applications like Dask are highlighted. Challenges and experiences in deploying this middleware to the HPC cloud environments for Azure, AWS, and Oracle Cloud are presented. The talk concludes with an overview of the newly established NSF-AI Institute ICICLE (https://icicle.osu.edu/) to address challenges in designing future high-performance edge-to-HPC/ cloud middleware for AI-driven data-intensive applications.