{"title":"DeepInteraction++: Multi-Modality Interaction for Autonomous Driving","authors":"Zeyu Yang;Nan Song;Wei Li;Xiatian Zhu;Li Zhang;Philip H.S. Torr","doi":"10.1109/TPAMI.2025.3565194","DOIUrl":"10.1109/TPAMI.2025.3565194","url":null,"abstract":"Existing top-performance autonomous driving systems typically rely on the <italic>multi-modal fusion</i> strategy for reliable scene understanding. This design is however fundamentally restricted due to overlooking the modality-specific strengths and finally hampering the model performance. To address this limitation, in this work, we introduce a novel <italic>modality interaction</i> strategy that allows individual per-modality representations to be learned and maintained throughout, enabling their unique characteristics to be exploited during the whole perception pipeline. To demonstrate the effectiveness of the proposed strategy, we design <italic>DeepInteraction++</i>, a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder. Specifically, the encoder is implemented as a dual-stream Transformer with specialized attention operation for information exchange and integration between separate modality-specific representations. Our multi-modal representational learning incorporates both object-centric, precise sampling-based feature alignment and global dense information spreading, essential for the more challenging planning task. The decoder is designed to iteratively refine the predictions by alternately aggregating information from separate representations in a unified modality-agnostic manner, realizing multi-modal predictive interaction. Extensive experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 8","pages":"6749-6763"},"PeriodicalIF":0.0,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143889993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Positive-Unlabeled Classification From Corrupted Data in GANs","authors":"Yunke Wang;Chang Xu;Tianyu Guo;Bo Du;Dacheng Tao","doi":"10.1109/TPAMI.2025.3565394","DOIUrl":"10.1109/TPAMI.2025.3565394","url":null,"abstract":"This paper defines a positive and unlabeled classification problem for standard GANs, which then leads to a novel technique to stabilize the training of the discriminator in GANs and deal with corrupted data. Traditionally, real data are taken as positive while generated data are negative. This positive-negative classification criterion was kept fixed all through the learning process of the discriminator without considering the gradually improved quality of generated data, even if they could be more realistic than real data at times. In contrast, it is more reasonable to treat the generated data as unlabeled, which could be positive or negative according to their quality. The discriminator is thus a classifier for this positive and unlabeled classification problem, and we derive a new Positive-Unlabeled GAN (PUGAN). We theoretically discuss the global optimality the proposed model will achieve and the equivalent optimization goal. Empirically, we find that PUGAN can achieve comparable or even better performance than those sophisticated discriminator stabilization methods. Considering the potential corrupted data problem in real-world scenarios, we further extend our approach to PUGAN-C, which treats real data as unlabeled that accounts for both clean and corrupted instances, and generated data as positive. The samples from generator could be closer to those corrupted data within unlabeled data at first, but within the framework of adversarial training, the generator will be optimized to cheat the discriminator and produce samples that are similar to those clean data. Experimental results on image generation from several corrupted datasets demonstrate the effectiveness and generalization of PUGAN-C.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 8","pages":"6859-6875"},"PeriodicalIF":0.0,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143889994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haitian Zhang;Chang Xu;Xinya Wang;Bingde Liu;Guang Hua;Lei Yu;Wen Yang
{"title":"Detecting Every Object From Events","authors":"Haitian Zhang;Chang Xu;Xinya Wang;Bingde Liu;Guang Hua;Lei Yu;Wen Yang","doi":"10.1109/TPAMI.2025.3565102","DOIUrl":"10.1109/TPAMI.2025.3565102","url":null,"abstract":"Object detection is critical in autonomous driving, and it is more practical yet challenging to localize objects of unknown categories: an endeavour known as Class-Agnostic Object Detection (CAOD). Existing studies on CAOD predominantly rely on RGB cameras, but these frame-based sensors usually have high latency and limited dynamic range, leading to safety risks under extreme conditions like fast-moving objects, overexposure, and darkness. In this study, we turn to the event-based vision, featured by its sub-millisecond latency and high dynamic range, for robust CAOD. We propose Detecting Every Object in Events (DEOE), an approach aimed at achieving high-speed, class-agnostic object detection in event-based vision. Built upon the fast event-based backbone: recurrent vision transformer, we jointly consider the spatial and temporal consistencies to identify potential objects. The discovered potential objects are assimilated as soft positive samples to avoid being suppressed as backgrounds. Moreover, we introduce a disentangled objectness head to separate the foreground-background classification and novel object discovery tasks, enhancing the model's generalization in localizing novel objects while maintaining a strong ability to filter out the background. Extensive experiments confirm the superiority of our proposed DEOE in both open-set and closed-set settings, outperforming strong baseline methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 8","pages":"7171-7178"},"PeriodicalIF":0.0,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weakly Supervised Micro- and Macro-Expression Spotting Based on Multi-Level Consistency","authors":"Wang-Wang Yu;Kai-Fu Yang;Hong-Mei Yan;Yong-Jie Li","doi":"10.1109/TPAMI.2025.3564951","DOIUrl":"10.1109/TPAMI.2025.3564951","url":null,"abstract":"Most micro- and macro-expression spotting methods in untrimmed videos suffer from the burden of video-wise collection and frame-wise annotation. Weakly supervised expression spotting (WES) based on video-level labels can potentially mitigate the complexity of frame-level annotation while achieving fine-grained frame-level spotting. However, we argue that existing weakly supervised methods are based on multiple instance learning (MIL) involving inter-modality, inter-sample, and inter-task gaps. The inter-sample gap is primarily from the sample distribution and duration. Therefore, we propose a novel and simple WES framework, MC-WES, using multi-consistency collaborative mechanisms that include modal-level saliency, video-level distribution, label-level duration and segment-level feature consistency strategies to implement fine frame-level spotting with only video-level labels to alleviate the above gaps and merge prior knowledge. The modal-level saliency consistency strategy focuses on capturing key correlations between raw images and optical flow. The video-level distribution consistency strategy utilizes the difference of sparsity in temporal distribution. The label-level duration consistency strategy exploits the difference in the duration of facial muscles. The segment-level feature consistency strategy emphasizes that features under the same labels maintain similarity. Experimental results on three challenging datasets–CAS(ME)<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>, CAS(ME)<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>, and SAMM-LV–demonstrate that MC-WES is comparable to state-of-the-art fully supervised methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 8","pages":"6912-6928"},"PeriodicalIF":0.0,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Normalized-Full-Palmar-Hand: Toward More Accurate Hand-Based Multimodal Biometrics","authors":"Yitao Qiao;Wenxiong Kang;Dacan Luo;Junduan Huang","doi":"10.1109/TPAMI.2025.3564514","DOIUrl":"10.1109/TPAMI.2025.3564514","url":null,"abstract":"Hand-based multimodal biometrics have attracted significant attention due to their high security and performance. However, existing methods fail to adequately decouple various hand biometric traits, limiting the extraction of unique features. Moreover, effective feature extraction for multiple hand traits remains a challenge. To address these issues, we propose a novel method for the precise decoupling of hand multimodal features called ‘Normalized-Full-Palmar-Hand’ and construct an authentication system based on this method. First, we propose HSANet, which accurately segments various hand regions with diverse backgrounds based on low-level details and high-level semantic information. Next, we establish two hand multimodal biometric databases with HSANet: SCUT Normalized-Full-Palmar-Hand Database Version 1 (SCUT_NFPH_v1) and Version 2 (SCUT_NFPH_v2). These databases include full hand images, semantic masks, and images of various hand biometric traits obtained from the same individual at the same scale, totaling 157,500 images. Third, we propose the Full Palmar Hand Authentication Network framework (FPHandNet) to extract unique features of multiple hand biometric traits. Finally, extensive experimental results, performed via the publicly available CASIA, IITD, COEP databases, and our proposed databases, validate the effectiveness of our methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 8","pages":"6715-6730"},"PeriodicalIF":0.0,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interactive Conversational Head Generation","authors":"Mohan Zhou;Yalong Bai;Wei Zhang;Ting Yao;Tiejun Zhao","doi":"10.1109/TPAMI.2025.3562651","DOIUrl":"10.1109/TPAMI.2025.3562651","url":null,"abstract":"We introduce a new conversation head generation benchmark for synthesizing behaviors of a single interlocutor in a face-to-face conversation. The capability to automatically synthesize interlocutors which can participate in long and multi-turn conversations is vital and offer benefits for various applications, including digital humans, virtual agents, and social robots. While existing research primarily focuses on talking head generation (one-way interaction), hindering the ability to create a digital human for conversation (two-way) interaction due to the absence of listening and interaction parts. In this work, we construct two datasets to address this issue, “ViCo” for independent talking and listening head generation tasks at the sentence level, and “ViCo-X”, for synthesizing interlocutors in multi-turn conversational scenarios. Based on ViCo and ViCo-X, we define three novel tasks targeting the interaction modeling during the face-to-face conversation: 1) responsive listening head generation making listeners respond actively to the speaker with non-verbal signals, 2) expressive talking head generation guiding speakers to be aware of listeners’ behaviors, and 3) conversational head generation to integrate the talking/listening ability in one interlocutor. Along with the datasets, we also propose corresponding baseline solutions to the three aforementioned tasks. Experimental results show that our baseline method could generate responsive and vivid agents that can collaborate with real person to fulfil the whole conversation.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 8","pages":"6673-6686"},"PeriodicalIF":0.0,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MC-ANN: A Mixture Clustering-Based Attention Neural Network for Time Series Forecasting","authors":"Yanhong Li;David C. Anastasiu","doi":"10.1109/TPAMI.2025.3565224","DOIUrl":"10.1109/TPAMI.2025.3565224","url":null,"abstract":"Time Series Forecasting (TSF) has been researched extensively, yet predicting time series with big variances and extreme events remains a challenging problem. Extreme events in reservoirs occur rarely but tend to cause huge problems, e.g., flooding entire towns or neighborhoods, which makes accurate reservoir water level prediction exceedingly important. In this work, we develop a novel extreme-adaptive forecasting approach to accommodate the big variance in hydrologic datasets. We model the time series data distribution as a mixture of both point-wise and segment-wise Gaussian distributions. In particular, we develop a novel End-To-End Mixture Clustering Attention Neural Network (MC-ANN) model for univariate time series forecasting, which we show is able to predict future reservoir water levels effectively. MC-ANN consists of two modules: 1) a grouped Auto-Encoder-based Forecaster (AEF) and 2) a mixture clustering-based learnable Weights Attention Network (WAN) with an attention mechanism. The WAN component is crucial, skillfully adjusting weights to distinguish data with varying distributions, enabling each AEF to concentrate on clusters of data with similar characteristics. Through extensive experiments on real-world datasets, we show MC-ANN’s effectiveness (10–45% root mean square error reductions over state-of-the-art methods), underlining its notable potential for practical applications in univariate, skewed, long-term time series prediction tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 8","pages":"6888-6899"},"PeriodicalIF":0.0,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10979493","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward Collaborative Autonomous Driving: Simulation Platform and End-to-End System","authors":"Genjia Liu;Yue Hu;Chenxin Xu;Weibo Mao;Junhao Ge;Zhengxiang Huang;Yifan Lu;Yinda Xu;Junkai Xia;Yafei Wang;Siheng Chen","doi":"10.1109/TPAMI.2025.3560327","DOIUrl":"10.1109/TPAMI.2025.3560327","url":null,"abstract":"Vehicle-to-everything-aided autonomous driving (V2X-AD) has a huge potential to provide a safer driving solution. Despite extensive research in transportation and communication to support V2X-AD, the actual utilization of these infrastructures and communication resources in enhancing driving performances remains largely unexplored. This highlights the necessity of collaborative autonomous driving; that is, a machine learning approach that optimizes the information sharing strategy to improve the driving performance of each vehicle. This effort necessitates two key foundations: a platform capable of generating data to facilitate the training and testing of V2X-AD, and a comprehensive system that integrates full driving-related functionalities with mechanisms for information sharing. From the platform perspective, we present <italic>V2Xverse</i>, a comprehensive simulation platform for collaborative autonomous driving. This platform provides a complete pipeline for collaborative driving: multi-agent driving dataset generation scheme, codebase for deploying full-stack collaborative driving systems, closed-loop driving performance evaluation with scenario customization. From the system perspective, we introduce <italic>CoDriving</i>, a novel end-to-end collaborative driving system that properly integrates V2X communication over the entire autonomous pipeline, promoting driving with shared perceptual information. The core idea is a novel driving-oriented communication strategy, that is, selectively complementing the driving-critical regions in single-view using sparse yet informative perceptual cues. Leveraging this strategy, CoDriving improves driving performance while optimizing communication efficiency. We make comprehensive benchmarks with V2Xverse, analyzing both modular performance and closed-loop driving performance. Experimental results show that CoDriving: i) significantly improves the driving score by 62.49% and drastically reduces the pedestrian collision rate by 53.50% compared to the SOTA end-to-end driving method, and ii) achieves sustaining driving performance superiority over dynamic constraint communication conditions.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 8","pages":"6566-6584"},"PeriodicalIF":0.0,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xi Chen;Lianghua Huang;Yu Liu;Yujun Shen;Deli Zhao;Hengshuang Zhao
{"title":"AnyDoor: Zero-Shot Image Customization With Region-to-Region Reference","authors":"Xi Chen;Lianghua Huang;Yu Liu;Yujun Shen;Deli Zhao;Hengshuang Zhao","doi":"10.1109/TPAMI.2025.3562237","DOIUrl":"10.1109/TPAMI.2025.3562237","url":null,"abstract":"This work presents <bold>AnyDoor</b>, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations with desired shapes. Instead of tuning parameters for each object, our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage. Such a challenging zero-shot setting requires an adequate characterization of a certain object. To this end, we leverage the powerful self-supervised image encoder (i.e., DINOv2) to extract the discriminative dentity feature of the target object. Besides, we complement the identity feature with detail features, which are carefully designed to maintain appearance details yet allow versatile local variations (e.g., lighting, orientation, posture, <italic>etc.</i>), supporting the object in favorably blending with different surroundings. We further propose to borrow knowledge from video datasets, where we can observe various forms (i.e., along the time axis) of a single object, leading to stronger model generalizability and robustness. Starting from the task of object insertion, we further extend the framework of AnyDoor to a general solution with region-to-region image reference. With the different definitions of the source region and target region, the tasks of object insertion, object removal, and image variation could be integrated into one model without introducing extra parameters. In addition, we investigate incorporating other conditions like the mask, pose skeleton, and depth map as additional guidance to achieve more controllable generation.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 8","pages":"6480-6495"},"PeriodicalIF":0.0,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143876057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HOT: An Efficient Halpern Accelerating Algorithm for Optimal Transport Problems","authors":"Guojun Zhang;Zhexuan Gu;Yancheng Yuan;Defeng Sun","doi":"10.1109/TPAMI.2025.3564353","DOIUrl":"10.1109/TPAMI.2025.3564353","url":null,"abstract":"This paper proposes an efficient HOT algorithm for solving the optimal transport (OT) problems with finite supports. We particularly focus on an efficient implementation of the HOT algorithm for the case where the supports are in <inline-formula><tex-math>$mathbb {R}^{2}$</tex-math></inline-formula> with ground distances calculated by <inline-formula><tex-math>$L_{2}^{2}$</tex-math></inline-formula>-norm. Specifically, we design a Halpern accelerating algorithm to solve the equivalent reduced model of the discrete OT problem. Moreover, we derive a novel procedure to solve the involved linear systems in the HOT algorithm in linear time complexity. Consequently, we can obtain an <inline-formula><tex-math>$varepsilon$</tex-math></inline-formula>-approximate solution to the optimal transport problem with <inline-formula><tex-math>$M$</tex-math></inline-formula> supports in <inline-formula><tex-math>$O(M^{1.5}/varepsilon )$</tex-math></inline-formula> flops, which significantly improves the best-known computational complexity. We further propose an efficient procedure to recover an optimal transport plan for the original OT problem based on a solution to the reduced model, thereby overcoming the limitations of the reduced OT model in applications that require the transport plan. We implement the HOT algorithm in PyTorch and extensive numerical results show the superior performance of the HOT algorithm compared to existing state-of-the-art algorithms for solving the OT problems.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 8","pages":"6703-6714"},"PeriodicalIF":0.0,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143876141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}