{"title":"BatOpt:使用动态批处理优化基于 GPU 的深度学习推理","authors":"Deyu Zhang;Yunzhen Luo;Yaobo Wang;Xiaoyan Kui;Ju Ren","doi":"10.1109/TCC.2024.3350561","DOIUrl":null,"url":null,"abstract":"Deep learning (DL) has been applied in billions of mobile devices due to its astonishing performance in image, text, and audio processing. However, limited by the computing capability of mobile devices, a large amount of DL inference tasks need to be offloaded to edge or cloud servers, which makes powerful GPU servers are struggling to ensure the quality of service(QoS). To better utilize the highly parallel computing architecture of GPU to improve the QoS, we propose BatOpt, a framework that uses dynamic batch processing to strike a good balance between service latency and GPU memory usage in DL inference services. Specifically, BatOpt innovatively models the DL inference service as a \n<inline-formula><tex-math>$M/G(a,b)/1/N$</tex-math></inline-formula>\n queue, with the consideration of stochastic task arrivals, which enables it to predict the service latency accurately in different system states. Furthermore, we propose an optimization algorithm to trade off the service latency and GPU memory usage in different system states by analyzing the queueing model. We have implemented BatOpt on Pytorch and evaluated it on an RTX 2080 GPU using real DL models. BatOpt brings up to 31x and 4.3x times performance boost in terms of service latency, compared to single-input and fixed-batch-size strategies, respectively. And BatOpt's maximum GPU memory usage is only 0.3x that of greedy-dynamic-batch-size strategy on the premise of the same service latency.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"12 1","pages":"174-185"},"PeriodicalIF":5.3000,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"BatOpt: Optimizing GPU-Based Deep Learning Inference Using Dynamic Batch Processing\",\"authors\":\"Deyu Zhang;Yunzhen Luo;Yaobo Wang;Xiaoyan Kui;Ju Ren\",\"doi\":\"10.1109/TCC.2024.3350561\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning (DL) has been applied in billions of mobile devices due to its astonishing performance in image, text, and audio processing. However, limited by the computing capability of mobile devices, a large amount of DL inference tasks need to be offloaded to edge or cloud servers, which makes powerful GPU servers are struggling to ensure the quality of service(QoS). To better utilize the highly parallel computing architecture of GPU to improve the QoS, we propose BatOpt, a framework that uses dynamic batch processing to strike a good balance between service latency and GPU memory usage in DL inference services. Specifically, BatOpt innovatively models the DL inference service as a \\n<inline-formula><tex-math>$M/G(a,b)/1/N$</tex-math></inline-formula>\\n queue, with the consideration of stochastic task arrivals, which enables it to predict the service latency accurately in different system states. Furthermore, we propose an optimization algorithm to trade off the service latency and GPU memory usage in different system states by analyzing the queueing model. We have implemented BatOpt on Pytorch and evaluated it on an RTX 2080 GPU using real DL models. BatOpt brings up to 31x and 4.3x times performance boost in terms of service latency, compared to single-input and fixed-batch-size strategies, respectively. And BatOpt's maximum GPU memory usage is only 0.3x that of greedy-dynamic-batch-size strategy on the premise of the same service latency.\",\"PeriodicalId\":13202,\"journal\":{\"name\":\"IEEE Transactions on Cloud Computing\",\"volume\":\"12 1\",\"pages\":\"174-185\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2024-01-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Cloud Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10382642/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cloud Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10382642/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
BatOpt: Optimizing GPU-Based Deep Learning Inference Using Dynamic Batch Processing
Deep learning (DL) has been applied in billions of mobile devices due to its astonishing performance in image, text, and audio processing. However, limited by the computing capability of mobile devices, a large amount of DL inference tasks need to be offloaded to edge or cloud servers, which makes powerful GPU servers are struggling to ensure the quality of service(QoS). To better utilize the highly parallel computing architecture of GPU to improve the QoS, we propose BatOpt, a framework that uses dynamic batch processing to strike a good balance between service latency and GPU memory usage in DL inference services. Specifically, BatOpt innovatively models the DL inference service as a
$M/G(a,b)/1/N$
queue, with the consideration of stochastic task arrivals, which enables it to predict the service latency accurately in different system states. Furthermore, we propose an optimization algorithm to trade off the service latency and GPU memory usage in different system states by analyzing the queueing model. We have implemented BatOpt on Pytorch and evaluated it on an RTX 2080 GPU using real DL models. BatOpt brings up to 31x and 4.3x times performance boost in terms of service latency, compared to single-input and fixed-batch-size strategies, respectively. And BatOpt's maximum GPU memory usage is only 0.3x that of greedy-dynamic-batch-size strategy on the premise of the same service latency.
期刊介绍:
The IEEE Transactions on Cloud Computing (TCC) is dedicated to the multidisciplinary field of cloud computing. It is committed to the publication of articles that present innovative research ideas, application results, and case studies in cloud computing, focusing on key technical issues related to theory, algorithms, systems, applications, and performance.