VideoQA-SC：用于视频问题解答的自适应语义交流

IEEE journal on selected areas in communications : a publication of the IEEE Communications Society Pub Date : 2025-04-09 DOI:10.1109/JSAC.2025.3559160

Jiangyuan Guo;Wei Chen;Yuxuan Sun;Jialong Xu;Bo Ai

{"title":"VideoQA-SC：用于视频问题解答的自适应语义交流","authors":"Jiangyuan Guo;Wei Chen;Yuxuan Sun;Jialong Xu;Bo Ai","doi":"10.1109/JSAC.2025.3559160","DOIUrl":null,"url":null,"abstract":"Although semantic communication (SC) has shown its potential in efficiently transmitting multimodal data such as texts, speeches and images, SC for videos has focused primarily on pixel-level reconstruction. However, these SC systems may be suboptimal for downstream intelligent tasks. Moreover, SC systems without pixel-level video reconstruction present advantages by achieving higher bandwidth efficiency and real-time performance of various intelligent tasks. The difficulty in such system design lies in the extraction of task-related compact semantic representations and their accurate delivery over noisy channels. In this paper, we propose an end-to-end SC system, named VideoQA-SC for video question answering (VideoQA) tasks. Our goal is to accomplish VideoQA tasks directly based on video semantics over noisy or fading wireless channels, bypassing the need for video reconstruction at the receiver. To this end, we develop a spatiotemporal semantic encoder for effective video semantic extraction, and a learning-based bandwidth-adaptive deep joint source-channel coding (DJSCC) scheme for efficient and robust video semantic transmission. Experiments demonstrate that VideoQA-SC outperforms traditional and advanced DJSCC-based SC systems that rely on video reconstruction at the receiver under a wide range of channel conditions and bandwidth constraints. In particular, when the signal-to-noise ratio is low, VideoQA-SC can improve the answer accuracy by 5.17% while saving almost 99.5% of the bandwidth at the same time, compared with the advanced DJSCC-based SC system. Our results show the great potential of SC system design for video applications.","PeriodicalId":73294,"journal":{"name":"IEEE journal on selected areas in communications : a publication of the IEEE Communications Society","volume":"43 7","pages":"2462-2477"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VideoQA-SC: Adaptive Semantic Communication for Video Question Answering\",\"authors\":\"Jiangyuan Guo;Wei Chen;Yuxuan Sun;Jialong Xu;Bo Ai\",\"doi\":\"10.1109/JSAC.2025.3559160\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Although semantic communication (SC) has shown its potential in efficiently transmitting multimodal data such as texts, speeches and images, SC for videos has focused primarily on pixel-level reconstruction. However, these SC systems may be suboptimal for downstream intelligent tasks. Moreover, SC systems without pixel-level video reconstruction present advantages by achieving higher bandwidth efficiency and real-time performance of various intelligent tasks. The difficulty in such system design lies in the extraction of task-related compact semantic representations and their accurate delivery over noisy channels. In this paper, we propose an end-to-end SC system, named VideoQA-SC for video question answering (VideoQA) tasks. Our goal is to accomplish VideoQA tasks directly based on video semantics over noisy or fading wireless channels, bypassing the need for video reconstruction at the receiver. To this end, we develop a spatiotemporal semantic encoder for effective video semantic extraction, and a learning-based bandwidth-adaptive deep joint source-channel coding (DJSCC) scheme for efficient and robust video semantic transmission. Experiments demonstrate that VideoQA-SC outperforms traditional and advanced DJSCC-based SC systems that rely on video reconstruction at the receiver under a wide range of channel conditions and bandwidth constraints. In particular, when the signal-to-noise ratio is low, VideoQA-SC can improve the answer accuracy by 5.17% while saving almost 99.5% of the bandwidth at the same time, compared with the advanced DJSCC-based SC system. Our results show the great potential of SC system design for video applications.\",\"PeriodicalId\":73294,\"journal\":{\"name\":\"IEEE journal on selected areas in communications : a publication of the IEEE Communications Society\",\"volume\":\"43 7\",\"pages\":\"2462-2477\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-04-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE journal on selected areas in communications : a publication of the IEEE Communications Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10960438/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE journal on selected areas in communications : a publication of the IEEE Communications Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10960438/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

尽管语义通信（SC）在有效传输文本、语音和图像等多模态数据方面显示出其潜力，但视频的语义通信主要集中在像素级重建上。然而，这些SC系统对于下游智能任务可能不是最优的。此外，无需像素级视频重构的SC系统具有更高的带宽效率和各种智能任务的实时性。这种系统设计的难点在于任务相关的紧凑语义表示的提取及其在噪声信道上的准确传递。在本文中，我们提出了一个端到端的视频问答系统，命名为VideoQA-SC，用于视频问答任务。我们的目标是直接基于噪声或衰落无线信道上的视频语义来完成VideoQA任务，而不需要在接收器上进行视频重建。为此，我们开发了一种用于有效视频语义提取的时空语义编码器，以及一种基于学习的带宽自适应深度联合源信道编码（DJSCC）方案，用于高效鲁棒的视频语义传输。实验表明，在广泛的信道条件和带宽限制下，VideoQA-SC优于传统的和先进的基于djsc的SC系统，这些系统依赖于接收机的视频重构。特别是在信噪比较低的情况下，与基于djsc的先进SC系统相比，VideoQA-SC的应答准确率提高了5.17%，同时节省了近99.5%的带宽。我们的研究结果显示了SC系统设计在视频应用中的巨大潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

VideoQA-SC: Adaptive Semantic Communication for Video Question Answering

Although semantic communication (SC) has shown its potential in efficiently transmitting multimodal data such as texts, speeches and images, SC for videos has focused primarily on pixel-level reconstruction. However, these SC systems may be suboptimal for downstream intelligent tasks. Moreover, SC systems without pixel-level video reconstruction present advantages by achieving higher bandwidth efficiency and real-time performance of various intelligent tasks. The difficulty in such system design lies in the extraction of task-related compact semantic representations and their accurate delivery over noisy channels. In this paper, we propose an end-to-end SC system, named VideoQA-SC for video question answering (VideoQA) tasks. Our goal is to accomplish VideoQA tasks directly based on video semantics over noisy or fading wireless channels, bypassing the need for video reconstruction at the receiver. To this end, we develop a spatiotemporal semantic encoder for effective video semantic extraction, and a learning-based bandwidth-adaptive deep joint source-channel coding (DJSCC) scheme for efficient and robust video semantic transmission. Experiments demonstrate that VideoQA-SC outperforms traditional and advanced DJSCC-based SC systems that rely on video reconstruction at the receiver under a wide range of channel conditions and bandwidth constraints. In particular, when the signal-to-noise ratio is low, VideoQA-SC can improve the answer accuracy by 5.17% while saving almost 99.5% of the bandwidth at the same time, compared with the advanced DJSCC-based SC system. Our results show the great potential of SC system design for video applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE journal on selected areas in communications : a publication of the IEEE Communications Society

自引率

0.00%

发文量