Multiscale Spatio-Temporal Fusion Network for video dehazing

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-08-07 DOI:10.1016/j.cviu.2025.104462

Qingru Zhang , Guorong Chen , Yixuan Zhang , Jinmei Zhang , Shaofeng Liu , Jian Wang

{"title":"Multiscale Spatio-Temporal Fusion Network for video dehazing","authors":"Qingru Zhang , Guorong Chen , Yixuan Zhang , Jinmei Zhang , Shaofeng Liu , Jian Wang","doi":"10.1016/j.cviu.2025.104462","DOIUrl":null,"url":null,"abstract":"<div><div>Video dehazing aims to restore high-resolution and high-contrast haze-free frames, which is crucial in engineering applications such as intelligent traffic monitoring systems. These monitoring systems heavily rely on clear visual information to ensure accurate decision-making and reliable operation. However, despite significant advances achieved by deep learning methods, they still face challenges when dealing with diverse real-world scenarios. To address these issues, we propose a Multi-Scale Spatio-Temporal Fusion Network (MSTF-Net), a novel framework designed to enhance video dehazing performance in complex engineering environments. Specifically, the MainAux Encoder integrates multi-source information through a progressively enhanced feature fusion mechanism, improving the representation of both global dynamics and local details. Furthermore, the Spatio-Temporal Adaptive Fusion (STAF) module ensures robust temporal consistency and spatial clarity by leveraging multi-level spatio-temporal information fusion. To evaluate our framework, we constructed a challenging dataset named “DarkRoad”, which includes low-light, uneven lighting, and dynamic outdoor scenarios, addressing the key limitations of existing datasets in video dehazing tasks. Extensive experiments demonstrate that MSTF-Net achieves state-of-the-art performance, excelling particularly in applications requiring high clarity, strong contrast, and detailed preservation, providing a reliable solution to video dehazing problems in practical engineering scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104462"},"PeriodicalIF":3.5000,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001857","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Video dehazing aims to restore high-resolution and high-contrast haze-free frames, which is crucial in engineering applications such as intelligent traffic monitoring systems. These monitoring systems heavily rely on clear visual information to ensure accurate decision-making and reliable operation. However, despite significant advances achieved by deep learning methods, they still face challenges when dealing with diverse real-world scenarios. To address these issues, we propose a Multi-Scale Spatio-Temporal Fusion Network (MSTF-Net), a novel framework designed to enhance video dehazing performance in complex engineering environments. Specifically, the MainAux Encoder integrates multi-source information through a progressively enhanced feature fusion mechanism, improving the representation of both global dynamics and local details. Furthermore, the Spatio-Temporal Adaptive Fusion (STAF) module ensures robust temporal consistency and spatial clarity by leveraging multi-level spatio-temporal information fusion. To evaluate our framework, we constructed a challenging dataset named “DarkRoad”, which includes low-light, uneven lighting, and dynamic outdoor scenarios, addressing the key limitations of existing datasets in video dehazing tasks. Extensive experiments demonstrate that MSTF-Net achieves state-of-the-art performance, excelling particularly in applications requiring high clarity, strong contrast, and detailed preservation, providing a reliable solution to video dehazing problems in practical engineering scenarios.

查看原文本刊更多论文

用于视频去雾的多尺度时空融合网络

视频去雾旨在恢复高分辨率和高对比度的无雾帧，这在智能交通监控系统等工程应用中至关重要。这些监控系统在很大程度上依赖于清晰的视觉信息，以确保准确的决策和可靠的运行。然而，尽管深度学习方法取得了重大进展，但在处理各种现实场景时，它们仍然面临挑战。为了解决这些问题，我们提出了一个多尺度时空融合网络（MSTF-Net），这是一个旨在提高复杂工程环境下视频去雾性能的新框架。具体来说，MainAux编码器通过逐步增强的特征融合机制集成了多源信息，改善了全局动态和局部细节的表示。此外，时空自适应融合（STAF）模块通过利用多层次时空信息融合，确保了强大的时间一致性和空间清晰度。为了评估我们的框架，我们构建了一个名为“DarkRoad”的具有挑战性的数据集，其中包括低光，不均匀照明和动态户外场景，解决了现有数据集在视频去雾任务中的关键限制。大量实验表明，MSTF-Net达到了最先进的性能，尤其在需要高清晰度、强对比度和详细保存的应用中表现出色，为实际工程场景中的视频除雾问题提供了可靠的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems