2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)最新文献

HPCFAIR: Enabling FAIR AI for HPC Applications HPCFAIR:为HPC应用程序启用公平AI

2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) Pub Date : 2021-11-01 DOI: 10.1109/mlhpc54614.2021.00011

Gaurav Verma, M. Emani, C. Liao, Pei-Hung Lin, T. Vanderbruggen, Xipeng Shen, Barbara M. Chapman

引用次数: 7

Production Deployment of Machine-Learned Rotorcraft Surrogate Models on HPC 基于HPC的机器学习旋翼机代理模型生产部署

2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) Pub Date : 2021-11-01 DOI: 10.1109/mlhpc54614.2021.00008

W. Brewer, Daniel Martínez, Mathew Boyer, D. Jude, A. Wissink, Ben Parsons, Junqi Yin, Valentine Anantharaj

{"title":"Production Deployment of Machine-Learned Rotorcraft Surrogate Models on HPC","authors":"W. Brewer, Daniel Martínez, Mathew Boyer, D. Jude, A. Wissink, Ben Parsons, Junqi Yin, Valentine Anantharaj","doi":"10.1109/mlhpc54614.2021.00008","DOIUrl":"https://doi.org/10.1109/mlhpc54614.2021.00008","url":null,"abstract":"We explore how to optimally deploy several different types of machine-learned surrogate models used in rotorcraft aerodynamics on HPC. We first developed three different rotorcraft models at three different orders of magnitude (2M, 44M, and 212M trainable parameters) to use as test models. Then we developed a benchmark, which we call “smiBench”, that uses synthetic data to test a wide range of alternative configurations to study optimal deployment scenarios. We discovered several different types of optimal deployment scenarios depending on the model size and inference frequency. For most cases, it makes sense to use multiple inference servers, each bound to a GPU with a load balancer distributing the requests across multiple GPUs. We tested three different types of inference server deployments: (1) a custom Flask-based HTTP inference server, (2) TensorFlow Serving with gRPC protocol, and (3) RedisAI server with SmartRedis clients using the RESP protocol. We also tested three different types of load balancing techniques for multi-GPU inferencing: (1) Python concurrent.futures thread pool, (2) HAProxy, and (3) mpi4py. We investigated deployments on both DoD HPCMP’s SCOUT and DoE OLCF’s Summit POWER9 supercomputers, demonstrated the ability to inference a million samples per second using 192 GPUs, and studied multiple scenarios on both Nvidia T4 and V100 GPUs. Moreover, we studied a range of concurrency levels, both on the client-side and the server-side, and provide optimal configuration advice based on the type of deployment. Finally, we provide a simple Python-based framework for benchmarking machine-learned surrogate models using the various inference servers.","PeriodicalId":101642,"journal":{"name":"2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132150135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

HPC Ontology: Towards a Unified Ontology for Managing Training Datasets and AI Models for High-Performance Computing 高性能计算本体:迈向管理训练数据集和高性能计算人工智能模型的统一本体

2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) Pub Date : 2021-11-01 DOI: 10.1109/mlhpc54614.2021.00012

C. Liao, Pei-Hung Lin, Gaurav Verma, T. Vanderbruggen, M. Emani, Zifan Nan, Xipeng Shen

{"title":"HPC Ontology: Towards a Unified Ontology for Managing Training Datasets and AI Models for High-Performance Computing","authors":"C. Liao, Pei-Hung Lin, Gaurav Verma, T. Vanderbruggen, M. Emani, Zifan Nan, Xipeng Shen","doi":"10.1109/mlhpc54614.2021.00012","DOIUrl":"https://doi.org/10.1109/mlhpc54614.2021.00012","url":null,"abstract":"Machine learning (ML) techniques have been widely studied to address various challenges of productively and efficiently running large-scale scientific applications on heterogeneous supercomputers. However, it is extremely difficult to generate, access, and maintain training datasets and AI models to accelerate ML-based research. The Future of Research Communications and e-Scholarship has proposed the FAIR data principles describing Findability, Accessibility, Interoperability, and Reusability. In this paper, we present our ongoing work of designing an ontology for high-performance computing (named HPC ontology) in order to make training datasets and AI models FAIR. Our ontology provides controlled vocabularies, explicit semantics, and formal knowledge representations. Our design uses an extensible two-level pattern, capturing both high-level meta information and low-level data content for software, hardware, experiments, workflows, training datasets, AI models, and so on. Preliminary evaluation shows that HPC ontology is effective to annotate selected data and support a set of SPARQL queries.","PeriodicalId":101642,"journal":{"name":"2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114149942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Is Disaggregation possible for HPC Cognitive Simulation? 在HPC认知模拟中，分解是可能的吗?

2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) Pub Date : 2021-11-01 DOI: 10.1109/mlhpc54614.2021.00014

Michael R. Wyatt, Valen Yamamoto, Zoë Tosi, I. Karlin, B. V. Essen

引用次数: 2

Semantic-Aware Lossless Data Compression for Deep Learning Recommendation Model (DLRM) 面向深度学习推荐模型的语义感知无损数据压缩

2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) Pub Date : 2021-11-01 DOI: 10.1109/mlhpc54614.2021.00006

S. Pumma, Abhinav Vishnu

{"title":"Semantic-Aware Lossless Data Compression for Deep Learning Recommendation Model (DLRM)","authors":"S. Pumma, Abhinav Vishnu","doi":"10.1109/mlhpc54614.2021.00006","DOIUrl":"https://doi.org/10.1109/mlhpc54614.2021.00006","url":null,"abstract":"As the architectures and capabilities of deep neural networks evolve, they become more sophisticated to train and use. Deep Learning Recommendation Model (DLRM), a new neural network for recommendation systems, introduces challenging requirements for deep neural network training and inference. The size of the DLRM model is typically large and not able to fit on a single GPU memory. Unlike other deep neural networks, DLRM requires both model-parallel and data-parallel for the bottom part and top part of the model when running on multiple GPUs. Due to the hybrid-parallel model, the all-to-all communication is used for welding the top and bottom parts together. We have observed that the all-to-all communication is costly and is a bottleneck in the DLRM training/inference. In this paper, we propose a novel approach to reduce the communication volume by using DLRM’s properties to compress the transferred data without information loss. We demonstrate benefits of our method by training DLRM MLPerf on eight AMD Instinc$mathrm{t}^{mathrm{T}mathrm{M}}$ MI100 accelerators. The experimental results show 59% and 38% improvement in the time-to-solution of the DLRM MLPerf training for FP32 and mixed-precision, respectively.","PeriodicalId":101642,"journal":{"name":"2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132182833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

[Copyright notice] (版权)

2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) Pub Date : 2021-11-01 DOI: 10.1109/mlhpc54614.2021.00002

引用次数: 0

Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing 面向高性能计算的基于可扩展机器学习的集成模拟转向

2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) Pub Date : 2021-10-06 DOI: 10.1109/MLHPC54614.2021.00007

Logan T. Ward, G. Sivaraman, J. G. Pauloski, Y. Babuji, Ryan Chard, Naveen K. Dandu, P. Redfern, R. Assary, K. Chard, L. Curtiss, R. Thakur, Ian T. Foster

{"title":"Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing","authors":"Logan T. Ward, G. Sivaraman, J. G. Pauloski, Y. Babuji, Ryan Chard, Naveen K. Dandu, P. Redfern, R. Assary, K. Chard, L. Curtiss, R. Thakur, Ian T. Foster","doi":"10.1109/MLHPC54614.2021.00007","DOIUrl":"https://doi.org/10.1109/MLHPC54614.2021.00007","url":null,"abstract":"Scientific applications that involve simulation ensembles can be accelerated greatly by using experiment design methods to select the best simulations to perform. Methods that use machine learning (ML) to create proxy models of simulations show particular promise for guiding ensembles but are challenging to deploy because of the need to coordinate dynamic mixes of simulation and learning tasks. We present Colmena, an open-source Python framework that allows users to steer campaigns by providing just the implementations of individual tasks plus the logic used to choose which tasks to execute when. Colmena handles task dispatch, results collation, ML model invocation, and ML model (re)training, using Parsl to execute tasks on HPC systems. We describe the design of Colmena and illustrate its capabilities by applying it to electrolyte design, where it both scales to 65536 CPUs and accelerates the discovery rate for high-performance molecules by a factor of 100 over unguided searches.","PeriodicalId":101642,"journal":{"name":"2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128144975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

HYPPO: A Surrogate-Based Multi-Level Parallelism Tool for Hyperparameter Optimization HYPPO:一种基于代理的多级并行超参数优化工具

2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) Pub Date : 2021-10-04 DOI: 10.1109/MLHPC54614.2021.00013

Vincent Dumont, Casey Garner, Anuradha Trivedi, Chelsea Jones, V. Ganapati, Juliane Mueller, T. Perciano, M. Kiran, Marcus Day

{"title":"HYPPO: A Surrogate-Based Multi-Level Parallelism Tool for Hyperparameter Optimization","authors":"Vincent Dumont, Casey Garner, Anuradha Trivedi, Chelsea Jones, V. Ganapati, Juliane Mueller, T. Perciano, M. Kiran, Marcus Day","doi":"10.1109/MLHPC54614.2021.00013","DOIUrl":"https://doi.org/10.1109/MLHPC54614.2021.00013","url":null,"abstract":"We present a new software, HYPPO, that enables the automatic tuning of hyperparameters of various deep learning (DL) models. Unlike other hyperparameter optimization (HPO) methods, HYPPO uses adaptive surrogate models and directly accounts for uncertainty in model predictions to find accurate and reliable models that make robust predictions. Using asynchronous nested parallelism, we are able to significantly alleviate the computational burden of training complex architectures and quantifying the uncertainty. HYPPO is implemented in Python and can be used with both TensorFlow and PyTorch libraries. We demonstrate various software features on time-series prediction and image classification problems as well as a scientific application in computed tomography image reconstruction. Finally, we show that (1) we can reduce by an order of magnitude the number of evaluations necessary to find the most optimal region in the hyperparameter space and (2) we can reduce by two orders of magnitude the throughput for such HPO process to complete.","PeriodicalId":101642,"journal":{"name":"2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124355225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3