Longguang Wang, Juncheng Li, Naoto Yokoya, Radu Timofte, Yulan Guo
{"title":"特邀社论:野生图像的高级修复和增强","authors":"Longguang Wang, Juncheng Li, Naoto Yokoya, Radu Timofte, Yulan Guo","doi":"10.1049/cvi2.12283","DOIUrl":null,"url":null,"abstract":"<p>Image restoration and enhancement has always been a fundamental task in computer vision and is widely used in numerous applications, such as surveillance imaging, remote sensing, and medical imaging. In recent years, remarkable progress has been witnessed with deep learning techniques. Despite the promising performance achieved on synthetic data, compelling research challenges remain to be addressed in the wild. These include: (i) degradation models for low-quality images in the real world are complicated and unknown, (ii) paired low-quality and high-quality data are difficult to acquire in the real world, and a large quantity of real data are provided in an unpaired form, (iii) it is challenging to incorporate cross-modal information provided by advanced imaging techniques (e.g. RGB-D camera) for image restoration, (iv) real-time inference on edge devices is important for image restoration and enhancement methods, and (v) it is difficult to provide the confidence or performance bounds of a learning-based method on different images/regions. This special issue invites original contributions in datasets, innovative architectures, and training methods for image restoration and enhancement to address these and other challenges.</p><p>In this Special Issue, we have received 17 papers, of which 8 papers underwent the peer review process, while the rest were desk-rejected. Among these reviewed papers, 5 papers have been accepted and 3 papers have been rejected as they did not meet the criteria of IET Computer Vision. Thus, the overall submissions were of high quality, which marks the success of this Special Issue.</p><p>The five eventually accepted papers can be clustered into two categories, namely video reconstruction and image super-resolution. The first category of papers aims at reconstructing high-quality videos. The papers in this category are of Zhang et al., Gu et al., and Xu et al. The second category of papers studies the task of image super-resolution. The papers in this category are of Dou et al. and Yang et al. A brief presentation of each of the paper in this special issue is as follows.</p><p>Zhang et al. propose a point-image fusion network for event-based frame interpolation. Temporal information in event streams plays a critical role in this task as it provides temporal context cues complementary to images. Previous approaches commonly transform the unstructured event data to structured data formats through voxelisation and then employ advanced CNNs to extract temporal information. However, the voxelisation operation inevitably leads to information loss and introduces redundant computation. To address these limitations, the proposed method directly extracts temporal information from the events at the point level without relying on any voxelisation operation. Afterwards, a fusion module is adopted to aggregate complementary cues from both points and images for frame interpolation. Experiments on both synthetic and real-world datasets show that their method produces state-of-the-art accuracy with high efficiency.</p><p>Gu et al. develop a temporal shift reconstruction network for compressive video sensing. To exploit the temporal cues between adjacent frames during the reconstruction of videos, most previous approaches commonly preform alignment between initial reconstructions. However, the estimated motions are usually too coarse to provide accurate temporal information. To remedy this, the proposed network employs stacked temporal shift reconstruction blocks to enhance the initial reconstruction progressively. Within each block, an efficient temporal shift operation is used to capture temporal structures in addition to computational overheads. Then, a bidirectional alignment module is adopted to capture the temporal dependencies in a video sequence. Different from previous methods that only extract supplementary information from the key frames, the proposed alignment module can receive temporal information from the whole video sequence via bidirectional propagations. Experiments demonstrate the superior performance of the proposed method.</p><p>Qu et al. propose a lightweight video frame interpolation network with a three-scale encoding-decoding structure. Specifically, multi-scale motion information is first extracted from the input video. Then, recurrent convolutional layers are adopted to refine the resultant features. Afterwards, the resultant features are aggregated to generate high-quality interpolated frames. Experimental results on the CelebA and Helen datasets show that the proposed method outperforms state-of-the-art methods while using fewer parameters.</p><p>Dou et al. introduce a decoder structure-guided CNN-Transformer network for face super-resolution. Most previous approaches follow a multi-task learning paradigm to perform landmark detection while super-resolving the low-resolution images. However, these methods require additional annotation cost, and the extracted facial prior structures are usually of low quality. To address these issues, the proposed network employs a global-local feature extraction unit to extract the global structure while capturing local texture details. In addition, a multi-state fusion module is incorporated to aggregate embeddings from different stages. Experiments show that the proposed method surpasses previous approaches by notable margins.</p><p>Yang et al. study the problem of blind super-resolution and propose a method to exploit degradation information through degradation representation learning. Specifically, a generative adversarial network is employed to model the degradation process from HR images to LR images and constrain the data distribution of the synthetic LR images. Then, the learnt representation is adopted to super-resolve the input low-resolution images using a transformer-based SR network. Experiments on both synthetic and real-world datasets demonstrate the effectiveness and superiority of the proposed method.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"435-438"},"PeriodicalIF":1.5000,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12283","citationCount":"0","resultStr":"{\"title\":\"Guest Editorial: Advanced image restoration and enhancement in the wild\",\"authors\":\"Longguang Wang, Juncheng Li, Naoto Yokoya, Radu Timofte, Yulan Guo\",\"doi\":\"10.1049/cvi2.12283\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Image restoration and enhancement has always been a fundamental task in computer vision and is widely used in numerous applications, such as surveillance imaging, remote sensing, and medical imaging. In recent years, remarkable progress has been witnessed with deep learning techniques. Despite the promising performance achieved on synthetic data, compelling research challenges remain to be addressed in the wild. These include: (i) degradation models for low-quality images in the real world are complicated and unknown, (ii) paired low-quality and high-quality data are difficult to acquire in the real world, and a large quantity of real data are provided in an unpaired form, (iii) it is challenging to incorporate cross-modal information provided by advanced imaging techniques (e.g. RGB-D camera) for image restoration, (iv) real-time inference on edge devices is important for image restoration and enhancement methods, and (v) it is difficult to provide the confidence or performance bounds of a learning-based method on different images/regions. This special issue invites original contributions in datasets, innovative architectures, and training methods for image restoration and enhancement to address these and other challenges.</p><p>In this Special Issue, we have received 17 papers, of which 8 papers underwent the peer review process, while the rest were desk-rejected. Among these reviewed papers, 5 papers have been accepted and 3 papers have been rejected as they did not meet the criteria of IET Computer Vision. Thus, the overall submissions were of high quality, which marks the success of this Special Issue.</p><p>The five eventually accepted papers can be clustered into two categories, namely video reconstruction and image super-resolution. The first category of papers aims at reconstructing high-quality videos. The papers in this category are of Zhang et al., Gu et al., and Xu et al. The second category of papers studies the task of image super-resolution. The papers in this category are of Dou et al. and Yang et al. A brief presentation of each of the paper in this special issue is as follows.</p><p>Zhang et al. propose a point-image fusion network for event-based frame interpolation. Temporal information in event streams plays a critical role in this task as it provides temporal context cues complementary to images. Previous approaches commonly transform the unstructured event data to structured data formats through voxelisation and then employ advanced CNNs to extract temporal information. However, the voxelisation operation inevitably leads to information loss and introduces redundant computation. To address these limitations, the proposed method directly extracts temporal information from the events at the point level without relying on any voxelisation operation. Afterwards, a fusion module is adopted to aggregate complementary cues from both points and images for frame interpolation. Experiments on both synthetic and real-world datasets show that their method produces state-of-the-art accuracy with high efficiency.</p><p>Gu et al. develop a temporal shift reconstruction network for compressive video sensing. To exploit the temporal cues between adjacent frames during the reconstruction of videos, most previous approaches commonly preform alignment between initial reconstructions. However, the estimated motions are usually too coarse to provide accurate temporal information. To remedy this, the proposed network employs stacked temporal shift reconstruction blocks to enhance the initial reconstruction progressively. Within each block, an efficient temporal shift operation is used to capture temporal structures in addition to computational overheads. Then, a bidirectional alignment module is adopted to capture the temporal dependencies in a video sequence. Different from previous methods that only extract supplementary information from the key frames, the proposed alignment module can receive temporal information from the whole video sequence via bidirectional propagations. Experiments demonstrate the superior performance of the proposed method.</p><p>Qu et al. propose a lightweight video frame interpolation network with a three-scale encoding-decoding structure. Specifically, multi-scale motion information is first extracted from the input video. Then, recurrent convolutional layers are adopted to refine the resultant features. Afterwards, the resultant features are aggregated to generate high-quality interpolated frames. Experimental results on the CelebA and Helen datasets show that the proposed method outperforms state-of-the-art methods while using fewer parameters.</p><p>Dou et al. introduce a decoder structure-guided CNN-Transformer network for face super-resolution. Most previous approaches follow a multi-task learning paradigm to perform landmark detection while super-resolving the low-resolution images. However, these methods require additional annotation cost, and the extracted facial prior structures are usually of low quality. To address these issues, the proposed network employs a global-local feature extraction unit to extract the global structure while capturing local texture details. In addition, a multi-state fusion module is incorporated to aggregate embeddings from different stages. Experiments show that the proposed method surpasses previous approaches by notable margins.</p><p>Yang et al. study the problem of blind super-resolution and propose a method to exploit degradation information through degradation representation learning. Specifically, a generative adversarial network is employed to model the degradation process from HR images to LR images and constrain the data distribution of the synthetic LR images. Then, the learnt representation is adopted to super-resolve the input low-resolution images using a transformer-based SR network. Experiments on both synthetic and real-world datasets demonstrate the effectiveness and superiority of the proposed method.</p>\",\"PeriodicalId\":56304,\"journal\":{\"name\":\"IET Computer Vision\",\"volume\":\"18 4\",\"pages\":\"435-438\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2024-04-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12283\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IET Computer Vision\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1049/cvi2.12283\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/cvi2.12283","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Guest Editorial: Advanced image restoration and enhancement in the wild
Image restoration and enhancement has always been a fundamental task in computer vision and is widely used in numerous applications, such as surveillance imaging, remote sensing, and medical imaging. In recent years, remarkable progress has been witnessed with deep learning techniques. Despite the promising performance achieved on synthetic data, compelling research challenges remain to be addressed in the wild. These include: (i) degradation models for low-quality images in the real world are complicated and unknown, (ii) paired low-quality and high-quality data are difficult to acquire in the real world, and a large quantity of real data are provided in an unpaired form, (iii) it is challenging to incorporate cross-modal information provided by advanced imaging techniques (e.g. RGB-D camera) for image restoration, (iv) real-time inference on edge devices is important for image restoration and enhancement methods, and (v) it is difficult to provide the confidence or performance bounds of a learning-based method on different images/regions. This special issue invites original contributions in datasets, innovative architectures, and training methods for image restoration and enhancement to address these and other challenges.
In this Special Issue, we have received 17 papers, of which 8 papers underwent the peer review process, while the rest were desk-rejected. Among these reviewed papers, 5 papers have been accepted and 3 papers have been rejected as they did not meet the criteria of IET Computer Vision. Thus, the overall submissions were of high quality, which marks the success of this Special Issue.
The five eventually accepted papers can be clustered into two categories, namely video reconstruction and image super-resolution. The first category of papers aims at reconstructing high-quality videos. The papers in this category are of Zhang et al., Gu et al., and Xu et al. The second category of papers studies the task of image super-resolution. The papers in this category are of Dou et al. and Yang et al. A brief presentation of each of the paper in this special issue is as follows.
Zhang et al. propose a point-image fusion network for event-based frame interpolation. Temporal information in event streams plays a critical role in this task as it provides temporal context cues complementary to images. Previous approaches commonly transform the unstructured event data to structured data formats through voxelisation and then employ advanced CNNs to extract temporal information. However, the voxelisation operation inevitably leads to information loss and introduces redundant computation. To address these limitations, the proposed method directly extracts temporal information from the events at the point level without relying on any voxelisation operation. Afterwards, a fusion module is adopted to aggregate complementary cues from both points and images for frame interpolation. Experiments on both synthetic and real-world datasets show that their method produces state-of-the-art accuracy with high efficiency.
Gu et al. develop a temporal shift reconstruction network for compressive video sensing. To exploit the temporal cues between adjacent frames during the reconstruction of videos, most previous approaches commonly preform alignment between initial reconstructions. However, the estimated motions are usually too coarse to provide accurate temporal information. To remedy this, the proposed network employs stacked temporal shift reconstruction blocks to enhance the initial reconstruction progressively. Within each block, an efficient temporal shift operation is used to capture temporal structures in addition to computational overheads. Then, a bidirectional alignment module is adopted to capture the temporal dependencies in a video sequence. Different from previous methods that only extract supplementary information from the key frames, the proposed alignment module can receive temporal information from the whole video sequence via bidirectional propagations. Experiments demonstrate the superior performance of the proposed method.
Qu et al. propose a lightweight video frame interpolation network with a three-scale encoding-decoding structure. Specifically, multi-scale motion information is first extracted from the input video. Then, recurrent convolutional layers are adopted to refine the resultant features. Afterwards, the resultant features are aggregated to generate high-quality interpolated frames. Experimental results on the CelebA and Helen datasets show that the proposed method outperforms state-of-the-art methods while using fewer parameters.
Dou et al. introduce a decoder structure-guided CNN-Transformer network for face super-resolution. Most previous approaches follow a multi-task learning paradigm to perform landmark detection while super-resolving the low-resolution images. However, these methods require additional annotation cost, and the extracted facial prior structures are usually of low quality. To address these issues, the proposed network employs a global-local feature extraction unit to extract the global structure while capturing local texture details. In addition, a multi-state fusion module is incorporated to aggregate embeddings from different stages. Experiments show that the proposed method surpasses previous approaches by notable margins.
Yang et al. study the problem of blind super-resolution and propose a method to exploit degradation information through degradation representation learning. Specifically, a generative adversarial network is employed to model the degradation process from HR images to LR images and constrain the data distribution of the synthetic LR images. Then, the learnt representation is adopted to super-resolve the input low-resolution images using a transformer-based SR network. Experiments on both synthetic and real-world datasets demonstrate the effectiveness and superiority of the proposed method.
期刊介绍:
IET Computer Vision seeks original research papers in a wide range of areas of computer vision. The vision of the journal is to publish the highest quality research work that is relevant and topical to the field, but not forgetting those works that aim to introduce new horizons and set the agenda for future avenues of research in computer vision.
IET Computer Vision welcomes submissions on the following topics:
Biologically and perceptually motivated approaches to low level vision (feature detection, etc.);
Perceptual grouping and organisation
Representation, analysis and matching of 2D and 3D shape
Shape-from-X
Object recognition
Image understanding
Learning with visual inputs
Motion analysis and object tracking
Multiview scene analysis
Cognitive approaches in low, mid and high level vision
Control in visual systems
Colour, reflectance and light
Statistical and probabilistic models
Face and gesture
Surveillance
Biometrics and security
Robotics
Vehicle guidance
Automatic model aquisition
Medical image analysis and understanding
Aerial scene analysis and remote sensing
Deep learning models in computer vision
Both methodological and applications orientated papers are welcome.
Manuscripts submitted are expected to include a detailed and analytical review of the literature and state-of-the-art exposition of the original proposed research and its methodology, its thorough experimental evaluation, and last but not least, comparative evaluation against relevant and state-of-the-art methods. Submissions not abiding by these minimum requirements may be returned to authors without being sent to review.
Special Issues Current Call for Papers:
Computer Vision for Smart Cameras and Camera Networks - https://digital-library.theiet.org/files/IET_CVI_SC.pdf
Computer Vision for the Creative Industries - https://digital-library.theiet.org/files/IET_CVI_CVCI.pdf