{"title":"Variational learned talking-head semantic coded transmission system","authors":"Weijie Yue, Zhongwei Si","doi":"10.23919/JCC.fa.2024-0036.202407","DOIUrl":null,"url":null,"abstract":"Video transmission requires considerable bandwidth, and current widely employed schemes prove inadequate when confronted with scenes featuring prominently. Motivated by the strides in talking-head generative technology, the paper introduces a semantic transmission system tailored for talking-head videos. The system captures semantic information from talking-head video and faithfully reconstructs source video at the receiver, only one-shot reference frame and compact semantic features are required for the entire transmission. Specifically, we analyze video semantics in the pixel domain frame-by-frame and jointly process multi-frame semantic information to seamlessly incorporate spatial and temporal information. Variational modeling is utilized to evaluate the diversity of importance among group semantics, thereby guiding bandwidth resource allocation for semantics to enhance system efficiency. The whole end-to-end system is modeled as an optimization problem and equivalent to acquiring optimal rate-distortion performance. We evaluate our system on both reference frame and video transmission, experimental results demonstrate that our system can improve the efficiency and robustness of communications. Compared to the classical approaches, our system can save over 90% of bandwidth when user perception is close.","PeriodicalId":504777,"journal":{"name":"China Communications","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"China Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/JCC.fa.2024-0036.202407","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Video transmission requires considerable bandwidth, and current widely employed schemes prove inadequate when confronted with scenes featuring prominently. Motivated by the strides in talking-head generative technology, the paper introduces a semantic transmission system tailored for talking-head videos. The system captures semantic information from talking-head video and faithfully reconstructs source video at the receiver, only one-shot reference frame and compact semantic features are required for the entire transmission. Specifically, we analyze video semantics in the pixel domain frame-by-frame and jointly process multi-frame semantic information to seamlessly incorporate spatial and temporal information. Variational modeling is utilized to evaluate the diversity of importance among group semantics, thereby guiding bandwidth resource allocation for semantics to enhance system efficiency. The whole end-to-end system is modeled as an optimization problem and equivalent to acquiring optimal rate-distortion performance. We evaluate our system on both reference frame and video transmission, experimental results demonstrate that our system can improve the efficiency and robustness of communications. Compared to the classical approaches, our system can save over 90% of bandwidth when user perception is close.