嵌入式 GPU 多租户推理的自适应内核合并与融合

IF 1.7 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Jaebeom Jeon;Gunjae Koo;Myung Kuk Yoon;Yunho Oh
{"title":"嵌入式 GPU 多租户推理的自适应内核合并与融合","authors":"Jaebeom Jeon;Gunjae Koo;Myung Kuk Yoon;Yunho Oh","doi":"10.1109/LES.2024.3351753","DOIUrl":null,"url":null,"abstract":"This letter proposes a new scheme that improves throughput and reduces queuing delay while running multiple inferences in embedded graphics processing unit (GPU)-based systems. We observe that an embedded system runs inference with a fixed number of deep learning models and that inference requests often use the same model. Unlike prior work that proposed kernel fusion or scheduling techniques, this letter proposes a new software technique that merges and fuses kernels by monitoring the requests in a queue. The proposed technique first monitors a fixed number of requests and groups the requests running the same model. Then, it creates the kernels that iteratively process the grouped requests. We call such a technique kernel merging. After that, the proposed technique performs kernel fusion with merged kernels. Eventually, our idea minimizes the number of concurrent kernels, thus mitigating stalls caused by frequent context switching in a GPU. In our evaluation, the proposed kernel merge and fusion achieve \n<inline-formula> <tex-math>$2.7\\times $ </tex-math></inline-formula>\n better throughput, 47% shorter average kernel execution time, and 63% shorter tail latency than prior work.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"16 4","pages":"421-424"},"PeriodicalIF":1.7000,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Adaptive Kernel Merge and Fusion for Multi-Tenant Inference in Embedded GPUs\",\"authors\":\"Jaebeom Jeon;Gunjae Koo;Myung Kuk Yoon;Yunho Oh\",\"doi\":\"10.1109/LES.2024.3351753\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This letter proposes a new scheme that improves throughput and reduces queuing delay while running multiple inferences in embedded graphics processing unit (GPU)-based systems. We observe that an embedded system runs inference with a fixed number of deep learning models and that inference requests often use the same model. Unlike prior work that proposed kernel fusion or scheduling techniques, this letter proposes a new software technique that merges and fuses kernels by monitoring the requests in a queue. The proposed technique first monitors a fixed number of requests and groups the requests running the same model. Then, it creates the kernels that iteratively process the grouped requests. We call such a technique kernel merging. After that, the proposed technique performs kernel fusion with merged kernels. Eventually, our idea minimizes the number of concurrent kernels, thus mitigating stalls caused by frequent context switching in a GPU. In our evaluation, the proposed kernel merge and fusion achieve \\n<inline-formula> <tex-math>$2.7\\\\times $ </tex-math></inline-formula>\\n better throughput, 47% shorter average kernel execution time, and 63% shorter tail latency than prior work.\",\"PeriodicalId\":56143,\"journal\":{\"name\":\"IEEE Embedded Systems Letters\",\"volume\":\"16 4\",\"pages\":\"421-424\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2024-01-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Embedded Systems Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10384636/\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Embedded Systems Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10384636/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

本文提出了一种在基于嵌入式图形处理单元(GPU)的系统中运行多个推理时提高吞吐量和减少排队延迟的新方案。我们观察到,嵌入式系统使用固定数量的深度学习模型运行推理,并且推理请求通常使用相同的模型。与先前提出内核融合或调度技术的工作不同,这封信提出了一种新的软件技术,通过监视队列中的请求来合并和融合内核。建议的技术首先监视固定数量的请求,并将运行相同模型的请求分组。然后,它创建迭代处理分组请求的内核。我们称这种技术为内核合并。然后,利用合并的核进行核融合。最终,我们的想法最小化了并发内核的数量,从而减轻了GPU中频繁上下文切换造成的延迟。在我们的评估中,所提出的内核合并和融合实现了2.7倍的吞吐量,平均内核执行时间缩短了47%,尾部延迟缩短了63%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Adaptive Kernel Merge and Fusion for Multi-Tenant Inference in Embedded GPUs
This letter proposes a new scheme that improves throughput and reduces queuing delay while running multiple inferences in embedded graphics processing unit (GPU)-based systems. We observe that an embedded system runs inference with a fixed number of deep learning models and that inference requests often use the same model. Unlike prior work that proposed kernel fusion or scheduling techniques, this letter proposes a new software technique that merges and fuses kernels by monitoring the requests in a queue. The proposed technique first monitors a fixed number of requests and groups the requests running the same model. Then, it creates the kernels that iteratively process the grouped requests. We call such a technique kernel merging. After that, the proposed technique performs kernel fusion with merged kernels. Eventually, our idea minimizes the number of concurrent kernels, thus mitigating stalls caused by frequent context switching in a GPU. In our evaluation, the proposed kernel merge and fusion achieve $2.7\times $ better throughput, 47% shorter average kernel execution time, and 63% shorter tail latency than prior work.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Embedded Systems Letters
IEEE Embedded Systems Letters Engineering-Control and Systems Engineering
CiteScore
3.30
自引率
0.00%
发文量
65
期刊介绍: The IEEE Embedded Systems Letters (ESL), provides a forum for rapid dissemination of latest technical advances in embedded systems and related areas in embedded software. The emphasis is on models, methods, and tools that ensure secure, correct, efficient and robust design of embedded systems and their applications.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信