嵌入式 GPU 多租户推理的自适应内核合并与融合

IF 1.7 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Embedded Systems Letters Pub Date : 2024-01-09 DOI:10.1109/LES.2024.3351753

Jaebeom Jeon;Gunjae Koo;Myung Kuk Yoon;Yunho Oh

{"title":"嵌入式 GPU 多租户推理的自适应内核合并与融合","authors":"Jaebeom Jeon;Gunjae Koo;Myung Kuk Yoon;Yunho Oh","doi":"10.1109/LES.2024.3351753","DOIUrl":null,"url":null,"abstract":"This letter proposes a new scheme that improves throughput and reduces queuing delay while running multiple inferences in embedded graphics processing unit (GPU)-based systems. We observe that an embedded system runs inference with a fixed number of deep learning models and that inference requests often use the same model. Unlike prior work that proposed kernel fusion or scheduling techniques, this letter proposes a new software technique that merges and fuses kernels by monitoring the requests in a queue. The proposed technique first monitors a fixed number of requests and groups the requests running the same model. Then, it creates the kernels that iteratively process the grouped requests. We call such a technique kernel merging. After that, the proposed technique performs kernel fusion with merged kernels. Eventually, our idea minimizes the number of concurrent kernels, thus mitigating stalls caused by frequent context switching in a GPU. In our evaluation, the proposed kernel merge and fusion achieve \n<inline-formula> <tex-math>$2.7\\times $ </tex-math></inline-formula>\n better throughput, 47% shorter average kernel execution time, and 63% shorter tail latency than prior work.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"16 4","pages":"421-424"},"PeriodicalIF":1.7000,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Adaptive Kernel Merge and Fusion for Multi-Tenant Inference in Embedded GPUs\",\"authors\":\"Jaebeom Jeon;Gunjae Koo;Myung Kuk Yoon;Yunho Oh\",\"doi\":\"10.1109/LES.2024.3351753\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This letter proposes a new scheme that improves throughput and reduces queuing delay while running multiple inferences in embedded graphics processing unit (GPU)-based systems. We observe that an embedded system runs inference with a fixed number of deep learning models and that inference requests often use the same model. Unlike prior work that proposed kernel fusion or scheduling techniques, this letter proposes a new software technique that merges and fuses kernels by monitoring the requests in a queue. The proposed technique first monitors a fixed number of requests and groups the requests running the same model. Then, it creates the kernels that iteratively process the grouped requests. We call such a technique kernel merging. After that, the proposed technique performs kernel fusion with merged kernels. Eventually, our idea minimizes the number of concurrent kernels, thus mitigating stalls caused by frequent context switching in a GPU. In our evaluation, the proposed kernel merge and fusion achieve \\n<inline-formula> <tex-math>$2.7\\\\times $ </tex-math></inline-formula>\\n better throughput, 47% shorter average kernel execution time, and 63% shorter tail latency than prior work.\",\"PeriodicalId\":56143,\"journal\":{\"name\":\"IEEE Embedded Systems Letters\",\"volume\":\"16 4\",\"pages\":\"421-424\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2024-01-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Embedded Systems Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10384636/\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Embedded Systems Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10384636/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

本文提出了一种在基于嵌入式图形处理单元（GPU）的系统中运行多个推理时提高吞吐量和减少排队延迟的新方案。我们观察到，嵌入式系统使用固定数量的深度学习模型运行推理，并且推理请求通常使用相同的模型。与先前提出内核融合或调度技术的工作不同，这封信提出了一种新的软件技术，通过监视队列中的请求来合并和融合内核。建议的技术首先监视固定数量的请求，并将运行相同模型的请求分组。然后，它创建迭代处理分组请求的内核。我们称这种技术为内核合并。然后，利用合并的核进行核融合。最终，我们的想法最小化了并发内核的数量，从而减轻了GPU中频繁上下文切换造成的延迟。在我们的评估中，所提出的内核合并和融合实现了2.7倍的吞吐量，平均内核执行时间缩短了47%，尾部延迟缩短了63%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Adaptive Kernel Merge and Fusion for Multi-Tenant Inference in Embedded GPUs

This letter proposes a new scheme that improves throughput and reduces queuing delay while running multiple inferences in embedded graphics processing unit (GPU)-based systems. We observe that an embedded system runs inference with a fixed number of deep learning models and that inference requests often use the same model. Unlike prior work that proposed kernel fusion or scheduling techniques, this letter proposes a new software technique that merges and fuses kernels by monitoring the requests in a queue. The proposed technique first monitors a fixed number of requests and groups the requests running the same model. Then, it creates the kernels that iteratively process the grouped requests. We call such a technique kernel merging. After that, the proposed technique performs kernel fusion with merged kernels. Eventually, our idea minimizes the number of concurrent kernels, thus mitigating stalls caused by frequent context switching in a GPU. In our evaluation, the proposed kernel merge and fusion achieve

$2.7\times $

better throughput, 47% shorter average kernel execution time, and 63% shorter tail latency than prior work.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Embedded Systems Letters Engineering-Control and Systems Engineering

CiteScore

3.30

自引率

0.00%

发文量

期刊介绍： The IEEE Embedded Systems Letters (ESL), provides a forum for rapid dissemination of latest technical advances in embedded systems and related areas in embedded software. The emphasis is on models, methods, and tools that ensure secure, correct, efficient and robust design of embedded systems and their applications.