W. Brewer, Daniel Martínez, Mathew Boyer, D. Jude, A. Wissink, Ben Parsons, Junqi Yin, Valentine Anantharaj
{"title":"Production Deployment of Machine-Learned Rotorcraft Surrogate Models on HPC","authors":"W. Brewer, Daniel Martínez, Mathew Boyer, D. Jude, A. Wissink, Ben Parsons, Junqi Yin, Valentine Anantharaj","doi":"10.1109/mlhpc54614.2021.00008","DOIUrl":null,"url":null,"abstract":"We explore how to optimally deploy several different types of machine-learned surrogate models used in rotorcraft aerodynamics on HPC. We first developed three different rotorcraft models at three different orders of magnitude (2M, 44M, and 212M trainable parameters) to use as test models. Then we developed a benchmark, which we call “smiBench”, that uses synthetic data to test a wide range of alternative configurations to study optimal deployment scenarios. We discovered several different types of optimal deployment scenarios depending on the model size and inference frequency. For most cases, it makes sense to use multiple inference servers, each bound to a GPU with a load balancer distributing the requests across multiple GPUs. We tested three different types of inference server deployments: (1) a custom Flask-based HTTP inference server, (2) TensorFlow Serving with gRPC protocol, and (3) RedisAI server with SmartRedis clients using the RESP protocol. We also tested three different types of load balancing techniques for multi-GPU inferencing: (1) Python concurrent.futures thread pool, (2) HAProxy, and (3) mpi4py. We investigated deployments on both DoD HPCMP’s SCOUT and DoE OLCF’s Summit POWER9 supercomputers, demonstrated the ability to inference a million samples per second using 192 GPUs, and studied multiple scenarios on both Nvidia T4 and V100 GPUs. Moreover, we studied a range of concurrency levels, both on the client-side and the server-side, and provide optimal configuration advice based on the type of deployment. Finally, we provide a simple Python-based framework for benchmarking machine-learned surrogate models using the various inference servers.","PeriodicalId":101642,"journal":{"name":"2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/mlhpc54614.2021.00008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
We explore how to optimally deploy several different types of machine-learned surrogate models used in rotorcraft aerodynamics on HPC. We first developed three different rotorcraft models at three different orders of magnitude (2M, 44M, and 212M trainable parameters) to use as test models. Then we developed a benchmark, which we call “smiBench”, that uses synthetic data to test a wide range of alternative configurations to study optimal deployment scenarios. We discovered several different types of optimal deployment scenarios depending on the model size and inference frequency. For most cases, it makes sense to use multiple inference servers, each bound to a GPU with a load balancer distributing the requests across multiple GPUs. We tested three different types of inference server deployments: (1) a custom Flask-based HTTP inference server, (2) TensorFlow Serving with gRPC protocol, and (3) RedisAI server with SmartRedis clients using the RESP protocol. We also tested three different types of load balancing techniques for multi-GPU inferencing: (1) Python concurrent.futures thread pool, (2) HAProxy, and (3) mpi4py. We investigated deployments on both DoD HPCMP’s SCOUT and DoE OLCF’s Summit POWER9 supercomputers, demonstrated the ability to inference a million samples per second using 192 GPUs, and studied multiple scenarios on both Nvidia T4 and V100 GPUs. Moreover, we studied a range of concurrency levels, both on the client-side and the server-side, and provide optimal configuration advice based on the type of deployment. Finally, we provide a simple Python-based framework for benchmarking machine-learned surrogate models using the various inference servers.