Nan Luo;Zhexuan Hu;Yuan Ding;Jiaxu Li;Hui Zhao;Gang Liu;Quan Wang
{"title":"DFF-VIO: A General Dynamic Feature Fused Monocular Visual-Inertial Odometry","authors":"Nan Luo;Zhexuan Hu;Yuan Ding;Jiaxu Li;Hui Zhao;Gang Liu;Quan Wang","doi":"10.1109/TCSVT.2024.3482573","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3482573","url":null,"abstract":"Integrating dynamic effects has shown its significance in enhancing the accuracy and robustness of Visual-Inertial Odometry (VIO) systems in dynamic scenarios. Existing methods either prune dynamic features or rely heavily on prior semantic knowledge or kinetic models, proved unfriendly to scenes with a multitude of dynamic elements. This work proposes a novel dynamic feature fusion method for monocular VIO, named DFF-VIO, which requires no prior models or scene preference. By combining IMU-predicted poses with visual clues, it initially identifies dynamic features during the tracking stage by constraints of consistency and degree of motion. Then, we innovatively design a Dynamic Transformation Operation (DTO) to separate the effect of dynamic features on multiple frames into pairwise effects and construct a Dynamic Feature Cell (DFC) to preserve the eligible information. Subsequently, we reformulate the VIO nonlinear optimization problem and construct dynamic feature residuals with the transformed DFC as a unit. Based on the proposed inter-frame model of moving features, a so-called motion compensation is developed to resolve the reprojection issue of dynamic features, allowing their effects to be incorporated into the VIO’s tight coupling optimization, thereby realizing robust positioning in dynamic scenarios. We conduct accuracy evaluations on ADVIO and VIODE, degradation tests on EuRoC dataset, as well as ablation studies to highlight the joint optimization of dynamic residuals. Results reveal that DFF-VIO outperforms state-of-the-art methods in pose accuracy and robustness across various dynamic environments.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 2","pages":"1758-1773"},"PeriodicalIF":8.3,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143403845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pei Zhang;Shuo Zhu;Chutian Wang;Yaping Zhao;Edmund Y. Lam
{"title":"Neuromorphic Imaging With Super-Resolution","authors":"Pei Zhang;Shuo Zhu;Chutian Wang;Yaping Zhao;Edmund Y. Lam","doi":"10.1109/TCSVT.2024.3482436","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3482436","url":null,"abstract":"Neuromorphic imaging is an emerging technique that imitates the human retina to sense variations in dynamic scenes. It responds to pixel-level brightness changes by asynchronous streaming events and boasts microsecond temporal precision over a high dynamic range, yielding blur-free recordings under extreme illumination. Nevertheless, this modality falls short in spatial resolution and leads to a low level of visual richness and clarity. Pursuing hardware upgrades is expensive and might cause compromised performance due to more burdens on computational requirements. Another option is to harness offline, plug-in-play super-resolution solutions. However, existing ones, which demand substantial sample volumes for lengthy training on massive computing resources, are largely restricted by real data availability owing to the current imperfect high-resolution devices, as well as the randomness and variability of motion. To tackle these challenges, we introduce the first self-supervised neuromorphic super-resolution prototype. It can be self-adaptive to per input source from any low-resolution camera to estimate an optimal, high-resolution counterpart of any scale, without the need of side knowledge and prior training. Evaluated on downstream tasks, such a simple yet effective method can obtain competitive results against the state-of-the-arts, significantly promoting flexibility but not sacrificing accuracy. It also delivers enhancements for inferior natural images and optical micrographs acquired under non-ideal imaging conditions, breaking through the limitations that are challenging to overcome with frame-based techniques. In the current landscape where the use of high-resolution cameras for event-based sensing remains an open debate, our solution is a cost-efficient and practical alternative, paving the way for more intelligent imaging systems.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 2","pages":"1715-1727"},"PeriodicalIF":8.3,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143403935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guangyong Gao;Sitian Yang;Xiangyang Hu;Zhihua Xia;Yun-Qing Shi
{"title":"Reversible Data Hiding-Based Local Contrast Enhancement With Nonuniform Superpixel Blocks for Medical Images","authors":"Guangyong Gao;Sitian Yang;Xiangyang Hu;Zhihua Xia;Yun-Qing Shi","doi":"10.1109/TCSVT.2024.3482556","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3482556","url":null,"abstract":"Reversible data hiding-based contrast enhancement can be applied to medical images, which not only allows the storage of patient information through reversible embedding, but also achieves image contrast enhancement, thereby assisting doctors in accurately diagnosing patient diseases. In response to the existing problems of mainstream methods, a novel reversible data hiding-based local contrast enhancement method for medical images is proposed. This method utilizes superpixel segmentation to segment medical images into multiple pixel blocks, and performs reversible data embedding and contrast enhancement for the pixel blocks within the region of interest (ROI). Additionally, a new embedding strategy is proposed. According to the contrast and texture features of each pixel block, histogram expansion of different degrees is carried out to effectively enhance the pixel blocks with low contrast, while avoiding excessive enhancement of the pixel blocks with high contrast. Experimental results demonstrate that, compared with the state-of-the-art mainstream methods, the proposed method not only improves the contrast in the ROI but also ensures high visual quality of the medical images.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 2","pages":"1745-1757"},"PeriodicalIF":8.3,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143403839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianqiu Chen;Zikun Zhou;Mingshan Sun;Rui Zhao;Liwei Wu;Tianpeng Bao;Zhenyu He
{"title":"ZeroPose: CAD-Prompted Zero-Shot Object 6D Pose Estimation in Cluttered Scenes","authors":"Jianqiu Chen;Zikun Zhou;Mingshan Sun;Rui Zhao;Liwei Wu;Tianpeng Bao;Zhenyu He","doi":"10.1109/TCSVT.2024.3482439","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3482439","url":null,"abstract":"Many robotics and industry applications have a high demand for the capability to estimate the 6D pose of novel objects from the cluttered scene. However, existing classic pose estimation methods are object-specific, which can only handle the specific objects seen during training. When applied to a novel object, these methods necessitate a cumbersome onboarding process, which involves extensive dataset preparation and model retraining. The extensive duration and resource consumption of onboarding limit their practicality in real-world applications In this paper, we introduce ZeroPose, a novel zero-shot framework that performs pose estimation following a Discovery-Orientation-Registration (DOR) inference pipeline. This framework generalizes to novel objects without requiring model retraining. Given the CAD model of a novel object, ZeroPose enables in seconds onboarding time to extract visual and geometric embeddings from the CAD model as a prompt. With the prompting of the above embeddings, DOR can discover all related instances and estimate their 6D poses without additional human interaction or presupposing scene conditions. Compared with existing zero-shot methods solved by the render-and-compare paradigm, the DOR pipeline formulates the object pose estimation into a feature-matching problem, which avoids time-consuming online rendering and improves efficiency. Experimental results on the seven datasets show that ZeroPose as a zero-shot method achieves comparable performance with object-specific training methods and outperforms the state-of-the-art zero-shot method with 50x inference speed improvement.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 2","pages":"1251-1264"},"PeriodicalIF":8.3,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143403899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zheng Cheng;Guodong Fan;Jingchun Zhou;Min Gan;C. L. Philip Chen
{"title":"FDCE-Net: Underwater Image Enhancement With Embedding Frequency and Dual Color Encoder","authors":"Zheng Cheng;Guodong Fan;Jingchun Zhou;Min Gan;C. L. Philip Chen","doi":"10.1109/TCSVT.2024.3482548","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3482548","url":null,"abstract":"Underwater images often suffer from various issues such as low brightness, color shift, blurred details, and noise due to light absorption and scattering caused by water and suspended particles. Previous underwater image enhancement (UIE) methods have primarily focused on spatial domain enhancement, neglecting the frequency domain information inherent in the images. However, the degradation factors of underwater images are closely intertwined in the spatial domain. Although certain methods focus on enhancing images in the frequency domain, they overlook the inherent relationship between the image degradation factors and the information present in the frequency domain. As a result, these methods frequently enhance certain attributes of the improved image while inadequately addressing or even exacerbating other attributes. Moreover, many existing methods heavily rely on prior knowledge to address color shift problems in underwater images, limiting their flexibility and robustness. In order to overcome these limitations, we propose the Embedding Frequency and Dual Color Encoder Network (FDCE-Net) in our paper. The FDCE-Net consists of two main structures: 1) Frequency Spatial Network (FS-Net) aims to achieve initial enhancement by utilizing our designed Frequency Spatial Residual Block (FSRB) to decouple image degradation factors in the frequency domain and enhance different attributes separately; 2) To tackle the color shift issue, we introduce the Dual-Color Encoder (DCE). The DCE establishes correlations between color and semantic representations through cross-attention and leverages multi-scale image features to guide the optimization of adaptive color query. The final enhanced images are generated by combining the outputs of FS-Net and DCE through a fusion network. These images exhibit rich details, clear textures, low noise and natural colors. Extensive experiments demonstrate that our FDCE-Net outperforms state-of-the-art (SOTA) methods in terms of both visual quality and quantitative metrics. The code of our model is publicly available at: <uri>https://github.com/Alexande-rChan/FDCE-Net</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 2","pages":"1728-1744"},"PeriodicalIF":8.3,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143404027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inter-Clip Feature Similarity Based Weakly Supervised Video Anomaly Detection via Multi-Scale Temporal MLP","authors":"Yuanhong Zhong;Ruyue Zhu;Ge Yan;Ping Gan;Xuerui Shen;Dong Zhu","doi":"10.1109/TCSVT.2024.3482414","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3482414","url":null,"abstract":"The major paradigm of weakly supervised video anomaly detection (WSVAD) is treating it as a multiple instance learning (MIL) problem, with only video-level labels available for training. Due to the rarity and ambiguity of anomaly, the selection of potential abnormal training sample is the prime challenge for WSVAD. Considering the temporal relevance and length variation of anomaly events, how to integrate the temporal information is also a controversial topic in WSVAD area. To address forementioned problems, we propose a novel method named Inter-clip Feature Similarity based Video Anomaly Detection (IFS-VAD). In the proposed IFS-VAD, to make use of both the global and local temporal relation, a Multi-scale Temporal MLP (MT-MLP) is leveraged. To better capture the ambiguous abnormal instances in positive bags, we introduce a novel anomaly criterion based on the Inter-clip Feature Similarity (IFS). The proposed IFS criterion can assist in discerning anomaly, as an additional anomaly score in the prediction process of anomaly classifier. Extensive experiments show that IFS-VAD demonstrates state-of-the-art performance on ShanghaiTech with an AUC of 97.95%, UCF-Crime with an AUC of 86.57% and XD-Violence with an AP of 83.14%. Our code implementation is accessible at <uri>https://github.com/Ria5331/IFS-VAD</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 2","pages":"1961-1970"},"PeriodicalIF":8.3,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143404017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple Pedestrian Tracking Under Occlusion: A Survey and Outlook","authors":"Zhihong Sun;Guoheng Wei;Wei Fu;Mang Ye;Kui Jiang;Chao Liang;Tingting Zhu;Tao He;Mithun Mukherjee","doi":"10.1109/TCSVT.2024.3481425","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3481425","url":null,"abstract":"As an intermediate task in computer vision, multiple pedestrian tracking (MPT) aiming at tracking the pedestrians from a given video, has attracted attention due to its potential academic and commercial value. However, pedestrians commonly suffer from occlusion due to diverse and complex scenarios, which increases the challenge of this task. This survey provides comprehensive review in terms of occlusion scenarios encountered during MPT, and investigates the model robustness of the existing methods in this scenarios. Firstly, this survey introduces the various and states of occlusion. Secondly, the related occlusion datasets are introduced. Subsequently, we categorize existing occlusion handling methods according to the tracking process and detail their pros and cons. In addition, occlusion handling precision (OHP) metric is proposed to evaluate the ability of a tracker in handling occlusion in this survey. Moreover, comprehensive analyzes and discussions in several public datasets are provided to verify the effectiveness of these methods. Finally, the existing issues and future directions for occlusion handling methods are discussed. In doing so, this work serves as a foundation for future research by providing researchers with information about the occlusion handling method of MPT.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 2","pages":"1009-1027"},"PeriodicalIF":8.3,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10720185","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143403815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zheng Li;Caili Guo;Xin Wang;Zerun Feng;Zhongtian Du
{"title":"Selectively Hard Negative Mining for Alleviating Gradient Vanishing in Image-Text Matching","authors":"Zheng Li;Caili Guo;Xin Wang;Zerun Feng;Zhongtian Du","doi":"10.1109/TCSVT.2024.3480949","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3480949","url":null,"abstract":"Most Image-Text Matching (ITM) models adopt Triplet loss with Hard Negative mining (T-HN) as the optimization objective. T-HN mines the hardest negative samples in each batch for training and achieves impressive performance. However, we observe that these ITM models have bad training behaviors in the early phases of training. Model training is difficult to converge, and matching performance is slow to improve. In this paper, we find that the cause of bad training behavior is that the model suffers from gradient vanishing. Optimizing an ITM model using only the hardest negative samples can easily lead to gradient vanishing. Through gradient analysis, we first derive the condition under which the gradient vanishes during training. We explain why the gradient tends to zero under certain conditions. To alleviate gradient vanishing, we propose a Triplet loss with Selectively Hard Negative mining (T-SelHN), which decides whether to mine the hardest negative samples according to the gradient vanishing condition. T-SelHN can be applied to ITM models in a plug-and-play manner to improve their training behaviors. To further ensure the back-propagation of gradients, we construct a Residual Visual Semantic Embedding model with T-SelHN, denoted RVSE++, which has a simple network structure and efficient training and inference speeds. Extensive experiments on two ITM benchmarks demonstrate the strength of RVSE++, achieving state-of-the-art performance. The code is available at <uri>https://github.com/AAA-Zheng/RVSEPP</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 2","pages":"1921-1935"},"PeriodicalIF":8.3,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143403837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Ensemble Learning With Category-Aware Attention and Local Contrastive Loss","authors":"Hongrui Guo;Tianqi Sun;Hongzhi Liu;Zhonghai Wu","doi":"10.1109/TCSVT.2024.3479313","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3479313","url":null,"abstract":"Machine learning techniques can help us deal with many difficult problems in the real world. Proper ensemble of multiple learners can improve the predictive performance. Each base learner usually has different predictive ability on different instances or in different instance regions. However, existing ensemble methods often assume that base learners have the same predictive ability for all instances without consideration of the specificity of different instances or categories. To address these issues, we propose an adaptive ensemble learning framework with category-aware attention and local contrastive loss, which can adaptively adjust the ensemble weight of each base classifier according to the characteristics of each instance. Specifically, we design a category-aware attention mechanism to learn the predictive ability of each classifier on different categories. Furthermore, we design a local contrastive loss to capture local similarities between instances and further enhance the model’s ability to discern fine-grained patterns in the data. Extensive experiments on 20 public datasets demonstrate the effectiveness of the proposed model.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 2","pages":"1224-1236"},"PeriodicalIF":8.3,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143404023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Frequency Decoupled Domain-Irrelevant Feature Learning for Pan-Sharpening","authors":"Jie Zhang;Ke Cao;Keyu Yan;Yunlong Lin;Xuanhua He;Yingying Wang;Rui Li;Chengjun Xie;Jun Zhang;Man Zhou","doi":"10.1109/TCSVT.2024.3480950","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3480950","url":null,"abstract":"Pan-sharpening aims to generate high-detail multi-spectral images (HRMS) through the fusion of panchromatic (PAN) and multi-spectral (MS) images. However, existing pan-sharpening methods often suffer from significant performance degradation when dealing with out-of-distribution data, as they assume the training and test datasets are independent and identically distributed. To overcome this challenge, we propose a novel frequency domain-irrelevant feature learning framework that exhibits exceptional generalization capabilities. Our approach involves parallel extraction and processing of domain-irrelevant information from the amplitude and phase components of the input images. Specifically, we design a frequency information separation module to extract the amplitude and phase components of the paired images. The learnable high-pass filter is then employed to eliminate domain-specific information from the amplitude spectrums. After that, we devised two specialized sub-networks (AFL-Net and PFL-Net) to perform targeted learning of the frequency domain-irrelevant information. This allows our method to effectively capture the complementary domain-irrelevant information contained in the amplitude and phase spectra of the images. Finally, the information fusion and restoration module dynamically adjusts the feature channel weights, enabling the network to output high-quality HRMS images. Through this frequency domain-irrelevant feature learning framework, our method balances generalization capability and network performance on the distribution of training dataset. Extensive experiments conducted on various satellite datasets demonstrate the effectiveness of our method for generalized pan-sharpening. Our proposed network outperforms state-of-the-art methods in terms of both quantitative metrics and visual quality, showcasing its superior ability to handle diverse, out-of-distribution data.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 2","pages":"1237-1250"},"PeriodicalIF":8.3,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143403820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}