VaVLM: Toward Efficient Edge-Cloud Video Analytics With Vision-Language Models

IF 4.8 1区计算机科学 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Broadcasting Pub Date : 2025-04-02 DOI:10.1109/TBC.2025.3549983

Yang Zhang;Hanling Wang;Qing Bai;Haifeng Liang;Peican Zhu;Gabriel-Miro Muntean;Qing Li

{"title":"VaVLM: Toward Efficient Edge-Cloud Video Analytics With Vision-Language Models","authors":"Yang Zhang;Hanling Wang;Qing Bai;Haifeng Liang;Peican Zhu;Gabriel-Miro Muntean;Qing Li","doi":"10.1109/TBC.2025.3549983","DOIUrl":null,"url":null,"abstract":"The advancement of Large Language Models (LLMs) with vision capabilities in recent years has elevated video analytics applications to new heights. To address the limited computing and bandwidth resources on edge devices, edge-cloud collaborative video analytics has emerged as a promising paradigm. However, most existing edge-cloud video analytics systems are designed for traditional deep learning models (e.g., image classification and object detection), where each model handles a specific task. In this paper, we introduce VaVLM, a novel edge-cloud collaborative video analytics system tailored for Vision-Language Models (VLMs), which can support multiple tasks using a single model. VaVLM aims to enhance the performance of VLM-powered video analytics systems in three key aspects. First, to reduce bandwidth consumption during video transmission, we propose a novel Region-of-Interest (RoI) generation mechanism based on the VLM’s understanding of the task and scene. Second, to lower inference costs, we design a task-oriented inference trigger that processes only a subset of video frames using an optimized inference logic. Third, to improve inference accuracy, the model is augmented with additional information from both the environment and auxiliary analytics models during the inference stage. Extensive experiments on real-world datasets demonstrate that VaVLM achieves an 80.3% reduction in bandwidth consumption and an 89.5% reduction in computational cost compared to baseline methods.","PeriodicalId":13159,"journal":{"name":"IEEE Transactions on Broadcasting","volume":"71 2","pages":"529-541"},"PeriodicalIF":4.8000,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10947590","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Broadcasting","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10947590/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

The advancement of Large Language Models (LLMs) with vision capabilities in recent years has elevated video analytics applications to new heights. To address the limited computing and bandwidth resources on edge devices, edge-cloud collaborative video analytics has emerged as a promising paradigm. However, most existing edge-cloud video analytics systems are designed for traditional deep learning models (e.g., image classification and object detection), where each model handles a specific task. In this paper, we introduce VaVLM, a novel edge-cloud collaborative video analytics system tailored for Vision-Language Models (VLMs), which can support multiple tasks using a single model. VaVLM aims to enhance the performance of VLM-powered video analytics systems in three key aspects. First, to reduce bandwidth consumption during video transmission, we propose a novel Region-of-Interest (RoI) generation mechanism based on the VLM’s understanding of the task and scene. Second, to lower inference costs, we design a task-oriented inference trigger that processes only a subset of video frames using an optimized inference logic. Third, to improve inference accuracy, the model is augmented with additional information from both the environment and auxiliary analytics models during the inference stage. Extensive experiments on real-world datasets demonstrate that VaVLM achieves an 80.3% reduction in bandwidth consumption and an 89.5% reduction in computational cost compared to baseline methods.

查看原文本刊更多论文

VaVLM：基于视觉语言模型的高效边缘云视频分析

近年来，具有视觉功能的大型语言模型（llm）的发展将视频分析应用提升到了新的高度。为了解决边缘设备上有限的计算和带宽资源，边缘云协作视频分析已经成为一种有前途的范例。然而，大多数现有的边缘云视频分析系统都是为传统的深度学习模型（例如，图像分类和对象检测）设计的，其中每个模型都处理特定的任务。本文介绍了一种为视觉语言模型（VLMs）量身定制的新型边缘云协同视频分析系统VaVLM，该系统可以使用单个模型支持多个任务。VaVLM的目标是在三个关键方面提高vlm驱动的视频分析系统的性能。首先，为了减少视频传输过程中的带宽消耗，我们提出了一种基于VLM对任务和场景理解的感兴趣区域（RoI）生成机制。其次，为了降低推理成本，我们设计了一个面向任务的推理触发器，该触发器使用优化的推理逻辑仅处理视频帧的子集。第三，为了提高推理精度，在推理阶段使用来自环境和辅助分析模型的附加信息对模型进行增强。在真实数据集上进行的大量实验表明，与基线方法相比，VaVLM的带宽消耗降低了80.3%，计算成本降低了89.5%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Broadcasting 工程技术-电信学

CiteScore

9.40

自引率

31.10%

发文量

审稿时长

6-12 weeks

期刊介绍： The Society’s Field of Interest is “Devices, equipment, techniques and systems related to broadcast technology, including the production, distribution, transmission, and propagation aspects.” In addition to this formal FOI statement, which is used to provide guidance to the Publications Committee in the selection of content, the AdCom has further resolved that “broadcast systems includes all aspects of transmission, propagation, and reception.”