{"title":"Efficient Neural Collaborative Search for Pickup and Delivery Problems","authors":"Detian Kong;Yining Ma;Zhiguang Cao;Tianshu Yu;Jianhua Xiao","doi":"10.1109/TPAMI.2024.3450850","DOIUrl":"10.1109/TPAMI.2024.3450850","url":null,"abstract":"In this paper, we introduce Neural Collaborative Search (NCS), a novel learning-based framework for efficiently solving pickup and delivery problems (PDPs). NCS pioneers the collaboration between the latest prevalent neural construction and neural improvement models, establishing a collaborative framework where an improvement model iteratively refines solutions initiated by a construction model. Our NCS collaboratively trains the two models via reinforcement learning with an effective shared-critic mechanism. In addition, the construction model enhances the improvement model with high-quality initial solutions via curriculum learning, while the improvement model accelerates the convergence of the construction model through imitation learning. Besides the new framework design, we also propose the efficient Neural Neighborhood Search (N2S), an efficient improvement model employed within the NCS framework. N2S exploits a tailored Markov decision process formulation and two customized decoders for removing and then reinserting a pair of pickup-delivery nodes, thereby learning a ruin-repair search process for addressing the precedence constraints in PDPs efficiently. To balance the computation cost between encoders and decoders, N2S streamlines the existing encoder design through a light Synthesis Attention mechanism that allows the vanilla self-attention to synthesize various features regarding a route solution. Moreover, a diversity enhancement scheme is further leveraged to ameliorate the performance during the inference of N2S. Our NCS and N2S are both generic, and extensive experiments on two canonical PDP variants show that they can produce state-of-the-art results among existing neural methods. Remarkably, our NCS and N2S could surpass the well-known LKH3 solver especially on the more constrained PDP variant.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11019-11034"},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142127749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangtai Li;Shilin Xu;Yibo Yang;Haobo Yuan;Guangliang Cheng;Yunhai Tong;Zhouchen Lin;Ming-Hsuan Yang;Dacheng Tao
{"title":"Panoptic-PartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation","authors":"Xiangtai Li;Shilin Xu;Yibo Yang;Haobo Yuan;Guangliang Cheng;Yunhai Tong;Zhouchen Lin;Ming-Hsuan Yang;Dacheng Tao","doi":"10.1109/TPAMI.2024.3453916","DOIUrl":"10.1109/TPAMI.2024.3453916","url":null,"abstract":"Panoptic Part Segmentation (PPS) unifies panoptic and part segmentation into one task. Previous works utilize separate approaches to handle things, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework, Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we first design a meta-architecture that decouples part features and things/stuff features, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Second, we propose a new metric Part-Whole Quality (PWQ), better to measure this task from pixel-region and part-whole perspectives. It also decouples the errors for part segmentation and panoptic segmentation. Third, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross-attention scheme to boost part segmentation qualities further. We design a new part-whole interaction method using masked cross attention. Finally, extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results. Our models can serve as a strong baseline and aid future research in PPS.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11087-11103"},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142127750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zdravko Marinov;Paul F. Jäger;Jan Egger;Jens Kleesiek;Rainer Stiefelhagen
{"title":"Deep Interactive Segmentation of Medical Images: A Systematic Review and Taxonomy","authors":"Zdravko Marinov;Paul F. Jäger;Jan Egger;Jens Kleesiek;Rainer Stiefelhagen","doi":"10.1109/TPAMI.2024.3452629","DOIUrl":"10.1109/TPAMI.2024.3452629","url":null,"abstract":"Interactive segmentation is a crucial research area in medical image analysis aiming to boost the efficiency of costly annotations by incorporating human feedback. This feedback takes the form of clicks, scribbles, or masks and allows for iterative refinement of the model output so as to efficiently guide the system towards the desired behavior. In recent years, deep learning-based approaches have propelled results to a new level causing a rapid growth in the field with 121 methods proposed in the medical imaging domain alone. In this review, we provide a structured overview of this emerging field featuring a comprehensive taxonomy, a systematic review of existing methods, and an in-depth analysis of current practices. Based on these contributions, we discuss the challenges and opportunities in the field. For instance, we find that there is a severe lack of comparison across methods which needs to be tackled by standardized baselines and benchmarks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10998-11018"},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10660300","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142101423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesco Taioli;Francesco Giuliari;Yiming Wang;Riccardo Berra;Alberto Castellini;Alessio Del Bue;Alessandro Farinelli;Marco Cristani;Francesco Setti
{"title":"Unsupervised Active Visual Search With Monte Carlo Planning Under Uncertain Detections","authors":"Francesco Taioli;Francesco Giuliari;Yiming Wang;Riccardo Berra;Alberto Castellini;Alessio Del Bue;Alessandro Farinelli;Marco Cristani;Francesco Setti","doi":"10.1109/TPAMI.2024.3451994","DOIUrl":"10.1109/TPAMI.2024.3451994","url":null,"abstract":"We propose a solution for Active Visual Search of objects in an environment, whose 2D floor map is the only known information. Our solution has three key features that make it more plausible and robust to detector failures compared to state-of-the-art methods: \u0000<i>i)</i>\u0000 it is unsupervised as it does not need any training sessions. \u0000<i>ii)</i>\u0000 During the exploration, a probability distribution on the 2D floor map is updated according to an intuitive mechanism, while an improved belief update increases the effectiveness of the agent's exploration. \u0000<i>iii)</i>\u0000 We incorporate the awareness that an object detector may fail into the aforementioned probability modelling by exploiting the success statistics of a specific detector. Our solution is dubbed POMP-BE-PD (Pomcp-based Online Motion Planning with Belief by Exploration and Probabilistic Detection). It uses the current pose of an agent and an RGB-D observation to learn an optimal search policy, exploiting a POMDP solved by a Monte-Carlo planning approach. On the Active Vision Dataset Benchmark, we increase the average success rate over all the environments by a significant 35\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000 while decreasing the average path length by 4\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000 with respect to competing methods. Thus, our results are state-of-the-art, even without any training procedure.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11047-11058"},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10659171","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142101430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Uni-to-Multi Modal Knowledge Distillation for Bidirectional LiDAR-Camera Semantic Segmentation","authors":"Tianfang Sun;Zhizhong Zhang;Xin Tan;Yong Peng;Yanyun Qu;Yuan Xie","doi":"10.1109/TPAMI.2024.3451658","DOIUrl":"10.1109/TPAMI.2024.3451658","url":null,"abstract":"Combining LiDAR points and images for robust semantic segmentation has shown great potential. However, the heterogeneity between the two modalities (e.g. the density, the field of view) poses challenges in establishing a bijective mapping between each point and pixel. This modality alignment problem introduces new challenges in network design and data processing for cross-modal methods. Specifically, 1) points that are projected outside the image planes; 2) the complexity of maintaining geometric consistency limits the deployment of many data augmentation techniques. To address these challenges, we propose a cross-modal knowledge imputation and transition approach. First, we introduce a bidirectional feature fusion strategy that imputes missing image features and performs cross-modal fusion simultaneously. This allows us to generate reliable predictions even when images are missing. Second, we propose a Uni-to-Multi modal Knowledge Distillation (U2MKD) framework, leveraging the transfer of informative features from a single-modality teacher to a cross-modality student. This overcomes the issues of augmentation misalignment and enables us to train the student effectively. Extensive experiments on the nuScenes, Waymo, and SemanticKITTI datasets demonstrate the effectiveness of our approach. Notably, our method achieves an 8.3 mIoU gain over the LiDAR-only baseline on the nuScenes validation set and achieves state-of-the-art performance on the three datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11059-11072"},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142101431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Topo-Geometric Analysis of Variability in Point Clouds Using Persistence Landscapes","authors":"James Matuk;Sebastian Kurtek;Karthik Bharath","doi":"10.1109/TPAMI.2024.3451328","DOIUrl":"10.1109/TPAMI.2024.3451328","url":null,"abstract":"Topological data analysis provides a set of tools to uncover low-dimensional structure in noisy point clouds. Prominent amongst the tools is persistence homology, which summarizes birth-death times of homological features using data objects known as persistence diagrams. To better aid statistical analysis, a functional representation of the diagrams, known as persistence landscapes, enable use of functional data analysis and machine learning tools. Topological and geometric variabilities inherent in point clouds are confounded in both persistence diagrams and landscapes, and it is important to distinguish topological signal from noise to draw reliable conclusions on the structure of the point clouds when using persistence homology. We develop a framework for decomposing variability in persistence diagrams into topological signal and topological noise through alignment of persistence landscapes using an elastic Riemannian metric. Aligned landscapes (amplitude) isolate the topological signal. Reparameterizations used for landscape alignment (phase) are linked to a resolution parameter used to generate persistence diagrams, and capture topological noise in the form of geometric, global scaling and sampling variabilities. We illustrate the importance of decoupling topological signal and topological noise in persistence diagrams (landscapes) using several simulated examples. We also demonstrate that our approach provides novel insights in two real data studies.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11035-11046"},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142086429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Playing for 3D Human Recovery","authors":"Zhongang Cai;Mingyuan Zhang;Jiawei Ren;Chen Wei;Daxuan Ren;Zhengyu Lin;Haiyu Zhao;Lei Yang;Chen Change Loy;Ziwei Liu","doi":"10.1109/TPAMI.2024.3450537","DOIUrl":"10.1109/TPAMI.2024.3450537","url":null,"abstract":"Image- and video-based 3D human recovery (i.e., pose and shape estimation) have achieved substantial progress. However, due to the prohibitive cost of motion capture, existing datasets are often limited in scale and diversity. In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths. Specifically, we contribute \u0000<bold>GTA-Human</b>\u0000, a large-scale 3D human dataset generated with the GTA-V game engine, featuring a highly diverse set of subjects, actions, and scenarios. More importantly, we study the use of game-playing data and obtain five major insights. \u0000<bold>First</b>\u0000, game-playing data is surprisingly effective. A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin. For video-based methods, GTA-Human is even on par with the in-domain training set. \u0000<bold>Second</b>\u0000, we discover that synthetic data provides critical complements to the real data that is typically collected indoor. We highlight that our investigation into domain gap provides explanations for our data mixture strategies that are simple yet useful, which offers new insights to the research community. \u0000<bold>Third</b>\u0000, the scale of the dataset matters. The performance boost is closely related to the additional data available. A systematic study on multiple key factors (such as camera angle and body pose) reveals that the model performance is sensitive to data density. \u0000<bold>Fourth</b>\u0000, the effectiveness of GTA-Human is also attributed to the rich collection of strong supervision labels (SMPL parameters), which are otherwise expensive to acquire in real datasets. \u0000<bold>Fifth</b>\u0000, the benefits of synthetic data extend to larger models such as deeper convolutional neural networks (CNNs) and Transformers, for which a significant impact is also observed. We hope our work could pave the way for scaling up 3D human recovery to the real world.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10533-10545"},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142082926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Long Zhuo;Guangcong Wang;Shikai Li;Wayne Wu;Ziwei Liu
{"title":"Fast-Vid2Vid++: Spatial-Temporal Distillation for Real-Time Video-to-Video Synthesis","authors":"Long Zhuo;Guangcong Wang;Shikai Li;Wayne Wu;Ziwei Liu","doi":"10.1109/TPAMI.2024.3450630","DOIUrl":"10.1109/TPAMI.2024.3450630","url":null,"abstract":"Video-to-Video synthesis (Vid2Vid) gains remarkable performance in generating a photo-realistic video from a sequence of semantic maps, such as segmentation, sketch and pose. However, this pipeline is heavily limited to high computational cost and long inference latency, mainly attributed to two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly reduced via more efficient network architectures. Existing methods mainly focus on slimming network architectures but ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal hybrid distillation compression framework, \u0000<bold>Fast-Vid2Vid++</b>\u0000, which focuses on knowledge distillation of the teacher network and the data stream of generative models on both space and time. Fast-Vid2Vid++ makes the first attempt at time dimension to transfer hierarchical features and time coherence knowledge to reduce computational resources and accelerate inference. Specifically, we compress the data stream spatially and reduce the temporal redundancy. We distill the knowledge of the hierarchical features and the final response from the teacher network to the student network in high-resolution and full-time domains. We transfer the long-term dependencies of the features and video frames to the student model. After the proposed spatial-temporal hybrid knowledge distillation (Spatial-Temporal-HKD), our model can synthesize high-resolution key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid++ interpolates intermediate frames by motion compensation with slight latency and generates full-length sequences with motion-aware inference (MAI). On standard benchmarks, Fast-Vid2Vid++ achieves a real-time performance of 30–59 FPS and saves 28–35× computational cost on a single V100 GPU.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10732-10747"},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142082984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vishwanath Saragadam;Randall Balestriero;Ashok Veeraraghavan;Richard G. Baraniuk
{"title":"DeepTensor: Low-Rank Tensor Decomposition With Deep Network Priors","authors":"Vishwanath Saragadam;Randall Balestriero;Ashok Veeraraghavan;Richard G. Baraniuk","doi":"10.1109/TPAMI.2024.3450575","DOIUrl":"10.1109/TPAMI.2024.3450575","url":null,"abstract":"DeepTensor is a computationally efficient framework for low-rank decomposition of matrices and tensors using deep generative networks. We decompose a tensor as the product of low-rank tensor factors (e.g., a matrix as the outer product of two vectors), where each low-rank tensor is generated by a deep network (DN) that is trained in a \u0000<italic>self-supervised</i>\u0000 manner to minimize the mean-square approximation error. Our key observation is that the implicit regularization inherent in DNs enables them to capture nonlinear signal structures (e.g., manifolds) that are out of the reach of classical linear methods like the singular value decomposition (SVD) and principal components analysis (PCA). Furthermore, in contrast to the SVD and PCA, whose performance deteriorates when the tensor’s entries deviate from additive white Gaussian noise, we demonstrate that the performance of DeepTensor is robust to a wide range of distributions. We validate that DeepTensor is a robust and computationally efficient drop-in replacement for the SVD, PCA, nonnegative matrix factorization (NMF), and similar decompositions by exploring a range of real-world applications, including hyperspectral image denoising, 3D MRI tomography, and image classification. In particular, DeepTensor offers a 6 dB signal-to-noise ratio improvement over standard denoising methods for signal corrupted by Poisson noise and learns to decompose 3D tensors 60 times faster than a single DN equipped with 3D convolutions.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10337-10348"},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142082983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Frequency-Aware Feature Fusion for Dense Image Prediction","authors":"Linwei Chen;Ying Fu;Lin Gu;Chenggang Yan;Tatsuya Harada;Gao Huang","doi":"10.1109/TPAMI.2024.3449959","DOIUrl":"10.1109/TPAMI.2024.3449959","url":null,"abstract":"Dense image prediction tasks demand features with strong category information and precise spatial boundary details at high resolution. To achieve this, modern hierarchical models often utilize feature fusion, directly adding upsampled coarse features from deep layers and high-resolution features from lower levels. In this paper, we observe rapid variations in fused feature values within objects, resulting in intra-category inconsistency due to disturbed high-frequency features. Additionally, blurred boundaries in fused features lack accurate high frequency, leading to boundary displacement. Building upon these observations, we propose Frequency-Aware Feature Fusion (FreqFusion), integrating an Adaptive Low-Pass Filter (ALPF) generator, an offset generator, and an Adaptive High-Pass Filter (AHPF) generator. The ALPF generator predicts spatially-variant low-pass filters to attenuate high-frequency components within objects, reducing intra-class inconsistency during upsampling. The offset generator refines large inconsistent features and thin boundaries by replacing inconsistent features with more consistent ones through resampling, while the AHPF generator enhances high-frequency detailed boundary information lost during downsampling. Comprehensive visualization and quantitative analysis demonstrate that FreqFusion effectively improves feature consistency and sharpens object boundaries. Extensive experiments across various dense prediction tasks confirm its effectiveness.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10763-10780"},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142074851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}