{"title":"STDF: Spatio-Temporal Deformable Fusion for Video Quality Enhancement on Embedded Platforms","authors":"Jianing Deng, Shunjie Dong, Lvcheng Chen, Jingtong Hu, Cheng Zhuo","doi":"10.1145/3645113","DOIUrl":null,"url":null,"abstract":"<p>With the development of embedded systems and deep learning, it is feasible to combine them for offering various and convenient human-centered services, which is based on high-quality (HQ) videos. However, due to the limit of video traffic load and unavoidable noise, the visual quality of an image from an edge camera may degrade significantly, influencing the overall video and service quality. To maintain video stability, video quality enhancement (QE), aiming at recovering high-quality (HQ) videos from their distorted low-quality (LQ) sources, has aroused increasing attention in recent years. The key challenge for video quality enhancement lies in how to effectively aggregate complementary information from multiple frames (i.e., temporal fusion). To handle diverse motion in videos, existing methods commonly apply motion compensation before the temporal fusion. However, the motion field estimated from the distorted LQ video tends to be inaccurate and unreliable, thereby resulting in ineffective fusion and restoration. In addition, motion estimation for consecutive frames is generally conducted in a pairwise manner, which leads to expensive and inefficient computation. In this paper, we propose a fast yet effective temporal fusion scheme for video QE by incorporating a novel Spatio-Temporal Deformable Convolution (STDC) to simultaneously compensate motion and aggregate temporal information. Specifically, the proposed temporal fusion scheme takes a target frame along with its adjacent reference frames as input to jointly estimate an offset field to deform the spatio-temporal sampling positions of convolution. As a result, complementary information from multiple frames can be fused within the STDC operation in one forward pass. Extensive experimental results on three benchmark datasets show that our method performs favorably to the state-of-the-arts in terms of accuracy and efficiency.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"11 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Embedded Computing Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3645113","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
With the development of embedded systems and deep learning, it is feasible to combine them for offering various and convenient human-centered services, which is based on high-quality (HQ) videos. However, due to the limit of video traffic load and unavoidable noise, the visual quality of an image from an edge camera may degrade significantly, influencing the overall video and service quality. To maintain video stability, video quality enhancement (QE), aiming at recovering high-quality (HQ) videos from their distorted low-quality (LQ) sources, has aroused increasing attention in recent years. The key challenge for video quality enhancement lies in how to effectively aggregate complementary information from multiple frames (i.e., temporal fusion). To handle diverse motion in videos, existing methods commonly apply motion compensation before the temporal fusion. However, the motion field estimated from the distorted LQ video tends to be inaccurate and unreliable, thereby resulting in ineffective fusion and restoration. In addition, motion estimation for consecutive frames is generally conducted in a pairwise manner, which leads to expensive and inefficient computation. In this paper, we propose a fast yet effective temporal fusion scheme for video QE by incorporating a novel Spatio-Temporal Deformable Convolution (STDC) to simultaneously compensate motion and aggregate temporal information. Specifically, the proposed temporal fusion scheme takes a target frame along with its adjacent reference frames as input to jointly estimate an offset field to deform the spatio-temporal sampling positions of convolution. As a result, complementary information from multiple frames can be fused within the STDC operation in one forward pass. Extensive experimental results on three benchmark datasets show that our method performs favorably to the state-of-the-arts in terms of accuracy and efficiency.
期刊介绍:
The design of embedded computing systems, both the software and hardware, increasingly relies on sophisticated algorithms, analytical models, and methodologies. ACM Transactions on Embedded Computing Systems (TECS) aims to present the leading work relating to the analysis, design, behavior, and experience with embedded computing systems.