STVAI：探索可扩展和高效智能视频推理的时空相似性

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing Pub Date : 2025-04-03 DOI:10.1016/j.jpdc.2025.105079

Chuang Li , Heshi Wang , Yanhua Wen , Qingyu Shi , Qinyu Wang , Chunhua Hu , Dongchen Wu

{"title":"STVAI：探索可扩展和高效智能视频推理的时空相似性","authors":"Chuang Li , Heshi Wang , Yanhua Wen , Qingyu Shi , Qinyu Wang , Chunhua Hu , Dongchen Wu","doi":"10.1016/j.jpdc.2025.105079","DOIUrl":null,"url":null,"abstract":"<div><div>The integration of video data computation and inference is a cornerstone for the evolution of multimodal artificial intelligence (MAI). The extensive adoption and optimization of CNN-based frameworks have significantly improved the accuracy of video inference, yet they present substantial challenges for real-time and large-scale computational demands. Existing researches primarily utilize the temporal similarity between video frames to reduce redundant computations, but most of them overlooked the spatial similarity within the frames themselves. Hence, we propose STVAI, a scalable and efficient method that leverages both spatial and temporal similarities to accelerate video inference. This approach uses a parallel region merging strategy, which maintains inference accuracy and enhances the sparsity of the computation matrix. Moreover, we have optimized the computation of sparse convolutions by utilizing Tensor Cores, which accelerate dense convolution computations based on the sparsity of the tiles. Experimental results demonstrate that STVAI achieves a stable acceleration of 1.25 times faster than cuDNN implementations, with only a 5% decrease in prediction accuracy. STVAI can achieve accelerations up to 1.53x, surpassing that of existing methods. Our method can be directly applied to various CNN architectures for video inference tasks without the need for retraining the model.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"201 ","pages":"Article 105079"},"PeriodicalIF":3.4000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"STVAI: Exploring spatio-temporal similarity for scalable and efficient intelligent video inference\",\"authors\":\"Chuang Li , Heshi Wang , Yanhua Wen , Qingyu Shi , Qinyu Wang , Chunhua Hu , Dongchen Wu\",\"doi\":\"10.1016/j.jpdc.2025.105079\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The integration of video data computation and inference is a cornerstone for the evolution of multimodal artificial intelligence (MAI). The extensive adoption and optimization of CNN-based frameworks have significantly improved the accuracy of video inference, yet they present substantial challenges for real-time and large-scale computational demands. Existing researches primarily utilize the temporal similarity between video frames to reduce redundant computations, but most of them overlooked the spatial similarity within the frames themselves. Hence, we propose STVAI, a scalable and efficient method that leverages both spatial and temporal similarities to accelerate video inference. This approach uses a parallel region merging strategy, which maintains inference accuracy and enhances the sparsity of the computation matrix. Moreover, we have optimized the computation of sparse convolutions by utilizing Tensor Cores, which accelerate dense convolution computations based on the sparsity of the tiles. Experimental results demonstrate that STVAI achieves a stable acceleration of 1.25 times faster than cuDNN implementations, with only a 5% decrease in prediction accuracy. STVAI can achieve accelerations up to 1.53x, surpassing that of existing methods. Our method can be directly applied to various CNN architectures for video inference tasks without the need for retraining the model.</div></div>\",\"PeriodicalId\":54775,\"journal\":{\"name\":\"Journal of Parallel and Distributed Computing\",\"volume\":\"201 \",\"pages\":\"Article 105079\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Parallel and Distributed Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0743731525000462\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Parallel and Distributed Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0743731525000462","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

视频数据计算和推理的集成是多模态人工智能（MAI）发展的基石。基于cnn的框架的广泛采用和优化大大提高了视频推理的准确性，但它们对实时和大规模计算需求提出了实质性挑战。现有研究主要利用视频帧之间的时间相似性来减少冗余计算，但大多忽略了帧本身的空间相似性。因此，我们提出了STVAI，这是一种可扩展且高效的方法，它利用空间和时间相似性来加速视频推理。该方法采用并行区域合并策略，既保持了推理精度，又提高了计算矩阵的稀疏性。此外，我们还利用Tensor Cores优化了稀疏卷积的计算，该算法基于贴图的稀疏性加速了密集卷积的计算。实验结果表明，STVAI实现的稳定加速速度比cuDNN实现快1.25倍，预测精度仅下降5%。STVAI可以实现高达1.53倍的加速度，超过了现有的方法。我们的方法可以直接应用于各种CNN架构的视频推理任务，而不需要对模型进行重新训练。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

STVAI: Exploring spatio-temporal similarity for scalable and efficient intelligent video inference

The integration of video data computation and inference is a cornerstone for the evolution of multimodal artificial intelligence (MAI). The extensive adoption and optimization of CNN-based frameworks have significantly improved the accuracy of video inference, yet they present substantial challenges for real-time and large-scale computational demands. Existing researches primarily utilize the temporal similarity between video frames to reduce redundant computations, but most of them overlooked the spatial similarity within the frames themselves. Hence, we propose STVAI, a scalable and efficient method that leverages both spatial and temporal similarities to accelerate video inference. This approach uses a parallel region merging strategy, which maintains inference accuracy and enhances the sparsity of the computation matrix. Moreover, we have optimized the computation of sparse convolutions by utilizing Tensor Cores, which accelerate dense convolution computations based on the sparsity of the tiles. Experimental results demonstrate that STVAI achieves a stable acceleration of 1.25 times faster than cuDNN implementations, with only a 5% decrease in prediction accuracy. STVAI can achieve accelerations up to 1.53x, surpassing that of existing methods. Our method can be directly applied to various CNN architectures for video inference tasks without the need for retraining the model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Parallel and Distributed Computing 工程技术-计算机：理论方法

CiteScore

10.30

自引率

2.60%

发文量

172

审稿时长

12 months

期刊介绍： This international journal is directed to researchers, engineers, educators, managers, programmers, and users of computers who have particular interests in parallel processing and/or distributed computing. The Journal of Parallel and Distributed Computing publishes original research papers and timely review articles on the theory, design, evaluation, and use of parallel and/or distributed computing systems. The journal also features special issues on these topics; again covering the full range from the design to the use of our targeted systems.