{"title":"EventHDR: From Event to High-Speed HDR Videos and Beyond.","authors":"Yunhao Zou, Ying Fu, Tsuyoshi Takatani, Yinqiang Zheng","doi":"10.1109/TPAMI.2024.3469571","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3469571","url":null,"abstract":"<p><p>Event cameras are innovative neuromorphic sensors that asynchronously capture the scene dynamics. Due to the event-triggering mechanism, such cameras record event streams with much shorter response latency and higher intensity sensitivity compared to conventional cameras. On the basis of these features, previous works have attempted to reconstruct high dynamic range (HDR) videos from events, but have either suffered from unrealistic artifacts or failed to provide sufficiently high frame rates. In this paper, we present a recurrent convolutional neural network that reconstruct high-speed HDR videos from event sequences, with a key frame guidance to prevent potential error accumulation caused by the sparse event data. Additionally, to address the problem of severely limited real dataset, we develop a new optical system to collect a real-world dataset with paired high-speed HDR videos and event streams, facilitating future research in this field. Our dataset provides the first real paired dataset for event-to-HDR reconstruction, avoiding potential inaccuracies from simulation strategies. Experimental results demonstrate that our method can generate high-quality, high-speed HDR videos. We further explore the potential of our work in cross-camera reconstruction and downstream computer vision tasks, including object detection, panoramic segmentation, optical flow estimation, and monocular depth estimation under HDR scenarios.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142396306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenyu Wu, Wei Wang, Lin Wang, Yacong Li, Fengmao Lv, Qing Xia, Chenglizhao Chen, Aimin Hao, Shuo Li
{"title":"Pixel is All You Need: Adversarial Spatio-Temporal Ensemble Active Learning for Salient Object Detection.","authors":"Zhenyu Wu, Wei Wang, Lin Wang, Yacong Li, Fengmao Lv, Qing Xia, Chenglizhao Chen, Aimin Hao, Shuo Li","doi":"10.1109/TPAMI.2024.3476683","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3476683","url":null,"abstract":"<p><p>Although weakly-supervised techniques can reduce the labeling effort, it is unclear whether a saliency model trained with weakly-supervised data (e.g., point annotation) can achieve the equivalent performance of its fully-supervised version. This paper attempts to answer this unexplored question by proving a hypothesis: there is a point-labeled dataset where saliency models trained on it can achieve equivalent performance when trained on the densely annotated dataset. To prove this conjecture, we proposed a novel yet effective adversarial spatio-temporal ensemble active learning. Our contributions are four- fold: 1) Our proposed adversarial attack triggering uncertainty can conquer the overconfidence of existing active learning methods and accurately locate these uncertain pixels. 2) Our proposed spatio-temporal ensemble strategy not only achieves outstanding performance but significantly reduces the model's computational cost. 3) Our proposed relationship-aware diversity sampling can conquer oversampling while boosting model performance. 4) We provide theoretical proof for the existence of such a point-labeled dataset. Experimental results show that our approach can find such a point-labeled dataset, where a saliency model trained on it obtained 98%-99% performance of its fully-supervised version with only ten annotated points per image. The code is available at https://github.com/wuzhenyubuaa/ASTE-AL.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142396308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miaoyu Li, Ying Fu, Tao Zhang, Ji Liu, Dejing Dou, Chenggang Yan, Yulun Zhang
{"title":"Latent Diffusion Enhanced Rectangle Transformer for Hyperspectral Image Restoration.","authors":"Miaoyu Li, Ying Fu, Tao Zhang, Ji Liu, Dejing Dou, Chenggang Yan, Yulun Zhang","doi":"10.1109/TPAMI.2024.3475249","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3475249","url":null,"abstract":"<p><p>The restoration of hyperspectral image (HSI) plays a pivotal role in subsequent hyperspectral image applications. Despite the remarkable capabilities of deep learning, current HSI restoration methods face challenges in effectively exploring the spatial non-local self-similarity and spectral low-rank property inherently embedded with HSIs. This paper addresses these challenges by introducing a latent diffusion enhanced rectangle Transformer for HSI restoration, tackling the non-local spatial similarity and HSI-specific latent diffusion low-rank property. In order to effectively capture non-local spatial similarity, we propose the multi-shape spatial rectangle self-attention module in both horizontal and vertical directions, enabling the model to utilize informative spatial regions for HSI restoration. Meanwhile, we propose a spectral latent diffusion enhancement module that generates the image-specific latent dictionary based on the content of HSI for low-rank vector extraction and representation. This module utilizes a diffusion model to generatively obtain representations of global low-rank vectors, thereby aligning more closely with the desired HSI. A series of comprehensive experiments were carried out on four common hyperspectral image restoration tasks, including HSI denoising, HSI super-resolution, HSI reconstruction, and HSI inpainting. The results of these experiments highlight the effectiveness of our proposed method, as demonstrated by improvements in both objective metrics and subjective visual quality.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142396307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NCMNet: Neighbor Consistency Mining Network for Two-View Correspondence Pruning","authors":"Xin Liu;Rong Qin;Junchi Yan;Jufeng Yang","doi":"10.1109/TPAMI.2024.3462453","DOIUrl":"10.1109/TPAMI.2024.3462453","url":null,"abstract":"Correspondence pruning plays a crucial role in a variety of feature matching based tasks, which aims at identifying correct correspondences (inliers) from initial ones. Seeking consistent \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000-nearest neighbors in both coordinate and feature spaces is a prevalent strategy employed in previous approaches. However, the vicinity of an inlier contains numerous irregular false correspondences (outliers), which leads them to mistakenly become neighbors according to the similarity constraint of nearest neighbors. To tackle this issue, we propose a global-graph space to seek consistent neighbors with similar graph structures. This is achieved by using a global connected graph to explicitly render the affinity relationship between correspondences based on the spatial and feature consistency. Furthermore, to enhance the robustness of method for various matching scenes, we develop a neighbor consistency block to adequately leverage the potential of three types of neighbors. The consistency can be progressively mined by sequentially extracting intra-neighbor context and exploring inter-neighbor interactions. Ultimately, we present a Neighbor Consistency Mining Network (NCMNet) to estimate the parametric models and remove outliers. Extensive experimental results demonstrate that the proposed method outperforms other state-of-the-art methods on various benchmarks for two-view geometry estimation. Meanwhile, four extended tasks, including remote sensing image registration, point cloud registration, 3D reconstruction, and visual localization, are conducted to test the generalization ability.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11254-11272"},"PeriodicalIF":0.0,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142376399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shaheer U Saeed, Shiqi Huang, Joao Ramalhinho, Iani J M B Gayo, Nina Montana-Brown, Ester Bonmati, Stephen P Pereira, Brian Davidson, Dean C Barratt, Matthew J Clarkson, Yipeng Hu
{"title":"Competing for Pixels: A Self-play Algorithm for Weakly-supervised Semantic Segmentation.","authors":"Shaheer U Saeed, Shiqi Huang, Joao Ramalhinho, Iani J M B Gayo, Nina Montana-Brown, Ester Bonmati, Stephen P Pereira, Brian Davidson, Dean C Barratt, Matthew J Clarkson, Yipeng Hu","doi":"10.1109/TPAMI.2024.3474094","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3474094","url":null,"abstract":"<p><p>Weakly-supervised semantic segmentation (WSSS) methods, reliant on image-level labels indicating object presence, lack explicit correspondence between labels and regions of interest (ROIs), posing a significant challenge. Despite this, WSSS methods have attracted attention due to their much lower annotation costs compared to fully-supervised segmentation. Leveraging reinforcement learning (RL) self-play, we propose a novel WSSS method that gamifies image segmentation of a ROI. We formulate segmentation as a competition between two agents that compete to select ROI-containing patches until exhaustion of all such patches. The score at each time-step, used to compute the reward for agent training, represents likelihood of object presence within the selection, determined by an object presence detector pre-trained using only image-level binary classification labels of object presence. Additionally, we propose a game termination condition that can be called by either side upon exhaustion of all ROI-containing patches, followed by the selection of a final patch from each. Upon termination, the agent is incentivised if ROI-containing patches are exhausted or disincentivised if a ROI-containing patch is found by the competitor. This competitive setup ensures minimisation of over- or under-segmentation, a common problem with WSSS methods. Extensive experimentation across four datasets demonstrates significant performance improvements over recent state-of-the-art methods. Code: https://github.com/s-sd/spurl/tree/main/wss.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142373925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optical Flow as Spatial-Temporal Attention Learners","authors":"Yawen Lu;Cheng Han;Qifan Wang;Heng Fan;Zhaodan Kong;Dongfang Liu;Yingjie Chen","doi":"10.1109/TPAMI.2024.3463648","DOIUrl":"10.1109/TPAMI.2024.3463648","url":null,"abstract":"Optical flow is an indispensable building block for various important computer vision tasks, including motion estimation, object tracking, and disparity measurement. To date, the dominant methods are CNN-based, leaving plenty of room for improvement. In this work, we propose TransFlow, a transformer architecture for optical flow estimation. Compared to dominant CNN-based methods, TransFlow demonstrates three advantages. First, it provides more accurate correlation and trustworthy matching in flow estimation by utilizing spatial self-attention and cross-attention mechanisms between adjacent frames to effectively capture global dependencies; Second, it recovers more compromised information (e.g., occlusion and motion blur) in flow estimation through long-range temporal association in dynamic scenes; Third, it introduces a concise self-learning paradigm, eliminating the need for complex and laborious multi-stage pre-training procedures. The versatility and superiority of TransFlow extend seamlessly to 3D scene motion, yielding competitive outcomes in 3D scene flow estimation. Our approach attains state-of-the-art results on benchmark datasets such as Sintel and KITTI-15, while also exhibiting exceptional performance on downstream tasks, including video object detection using the ImageNet VID dataset, video frame interpolation using the GoPro dataset, and video stabilization using the DeepStab dataset. We believe that the effectiveness of TransFlow positions it as a flexible baseline for both optical flow and scene flow estimation, offering promising avenues for future research and development.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11491-11506"},"PeriodicalIF":0.0,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142373926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sparse Non-Local CRF With Applications.","authors":"Olga Veksler, Yuri Boykov","doi":"10.1109/TPAMI.2024.3474468","DOIUrl":"10.1109/TPAMI.2024.3474468","url":null,"abstract":"<p><p>CRFs model spatial coherence in classical and deep learning computer vision. The most common CRF is called pairwise, as it connects pixel pairs. There are two types of pairwise CRF: sparse and dense. A sparse CRF connects the nearby pixels, leading to a linear number of connections in the image size. A dense CRF connects all pixel pairs, leading to a quadratic number of connections. While dense CRF is a more general model, it is much less efficient than sparse CRF. In fact, only Gaussian edge dense CRF is used in practice, and even then with approximations. We propose a new pairwise CRF, which we call sparse non-local CRF. Like dense CRF, it has non-local connections, and, therefore, it is more general than sparse CRF. Like sparse CRF, the number of connections is linear, and, therefore, our model is efficient. Besides efficiency, another advantage is that our edge weights are unrestricted. We show that our sparse non-local CRF models properties similar to that of Gaussian dense CRF. We also discuss connections to other CRF models. We demonstrate the usefulness of our model on classical and deep learning applications, for two and multiple labels.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142373927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EuroCity Persons 2.0: A Large and Diverse Dataset of Persons in Traffic","authors":"Sebastian Krebs;Markus Braun;Dariu M. Gavrila","doi":"10.1109/TPAMI.2024.3471170","DOIUrl":"10.1109/TPAMI.2024.3471170","url":null,"abstract":"We present the EuroCity Persons (ECP) 2.0 dataset, a novel image dataset for person detection, tracking and prediction in traffic. The dataset was collected on-board a vehicle driving through 29 cities in 11 European countries. It contains more than 250K unique person trajectories, in more than 2.0M images and comes with a size of 11 TB. ECP2.0 is about one order of magnitude larger than previous state-of-the-art person datasets in automotive context. It offers remarkable diversity in terms of geographical coverage, time of day, weather and seasons. We discuss the novel semi-supervised approach that was used to generate the temporally dense pseudo ground-truth (i.e., 2D bounding boxes, 3D person locations) from sparse, manual annotations at keyframes. Our approach leverages auxiliary LiDAR data for 3D uplifting and vehicle inertial sensing for ego-motion compensation. It incorporates keyframe information in a three-stage approach (tracklet generation, tracklet merging into tracks, track smoothing) for obtaining accurate person trajectories. We validate our pseudo ground-truth generation approach in ablation studies, and show that it significantly outperforms existing methods. Furthermore, we demonstrate its benefits for training and testing of state-of-the-art tracking methods. Our approach provides a speed-up factor of about 34 compared to frame-wise manual annotation. The ECP2.0 dataset is made freely available for non-commercial research use.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10929-10943"},"PeriodicalIF":0.0,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142367960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Inbal Kom Betzer, Roi Ronen, Vadim Holodovsky, Yoav Y Schechner, Ilan Koren
{"title":"NeMF: Neural Microphysics Fields.","authors":"Inbal Kom Betzer, Roi Ronen, Vadim Holodovsky, Yoav Y Schechner, Ilan Koren","doi":"10.1109/TPAMI.2024.3467913","DOIUrl":"10.1109/TPAMI.2024.3467913","url":null,"abstract":"<p><p>Inverse problems in scientific imaging often seek physical characterization of heterogeneous scene materials. The scene is thus represented by physical quantities, such as the density and sizes of particles (microphysics) across a domain. Moreover, the forward image formation model is physical. An important case is that of clouds, where microphysics in three dimensions (3D) dictate the cloud dynamics, lifetime and albedo, with implications to Earth's energy balance, sustainable energy and rainfall. Current methods, however, recover very degenerate representations of microphysics. To enable 3D volumetric recovery of all the required microphysical parameters, we introduce the neural microphysics field (NeMF). It is based on a deep neural network, whose input is multi-view polarization images. NeMF is pre-trained through supervised learning. Training relies on polarized radiative transfer, and noise modeling in polarization-sensitive sensors. The results offer unprecedented recovery, including droplet effective variance. We test NeMF in rigorous simulations and demonstrate it using real-world polarization-image data.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142335180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qing Xiao, Guiying Liu, Qianjin Feng, Yu Zhang, Zhenyuan Ning
{"title":"Tensor Coupled Learning of Incomplete Longitudinal Features and Labels for Clinical Score Regression.","authors":"Qing Xiao, Guiying Liu, Qianjin Feng, Yu Zhang, Zhenyuan Ning","doi":"10.1109/TPAMI.2024.3471800","DOIUrl":"10.1109/TPAMI.2024.3471800","url":null,"abstract":"<p><p>Longitudinal data with incomplete entries pose a significant challenge for clinical score regression over multiple time points. Although many methods primarily estimate longitudinal scores with complete baseline features (i.e., features collected at the initial time point), such snapshot features may overlook beneficial latent longitudinal traits for generalization. Alternatively, certain completion approaches (e.g., tensor decomposition technology) have been proposed to impute incomplete longitudinal data before score estimation, most of which, however, are transductive and cannot utilize label semantics. This work presents a tensor coupled learning (TCL) paradigm of incomplete longitudinal features and labels for clinical score regression. The TCL enjoys three advantages: 1) It drives semantic-aware factor matrices and collaboratively deals with incomplete longitudinal entries (of features and labels), during which a dynamic regularizer is designed for adaptive attribute selection. 2) It establishes a closed loop connecting baseline features and the coupled factor matrices, which enables inductive inference of longitudinal scores relying on only baseline features. 3) It reinforces the information encoding of baseline data by preserving the local manifold of longitudinal feature space and detecting the temporal alteration across multiple time points. Extensive experiments demonstrate the remarkable performance improvement of our method on clinical score regression with incomplete longitudinal data.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142335181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}