{"title":"Towards OOD Object Detection with Unknown-Concept Guided Feature Diffusion.","authors":"Aming Wu,Cheng Deng","doi":"10.1109/tpami.2025.3590735","DOIUrl":"https://doi.org/10.1109/tpami.2025.3590735","url":null,"abstract":"In general, learning plentiful knowledge corresponding to known objects is an important ability for humans. The unknown objects could be assumed to depart from the familiar knowledge. Inspired by this idea, we explore leveraging the extracted knowledge to reason a set of unknown concepts. And they could be used to address unsupervised out-of-distribution object detection (OOD-OD) that aims to detect unseen OOD objects without accessing any auxiliary OOD data during training. To this end, we propose a new approach, i.e., Unknown-Concept Guided Feature Diffusion (UCFD), including an object-related knowledge extractor and an unknown-concept guided diffusor for synthesizing virtual OOD features. Specifically, we define multiple learnable codewords to capture object-relevant visual knowledge from all object categories. To avoid the detection performance degradation of the in-distribution (ID) objects, these codewords are utilized to enhance object features. Next, an unknown-concept pool is constructed by mixing up these extracted codewords. Finally, to reduce the impact of lacking OOD data for supervision, we design an unknown-concept guided diffusor, which leverages the sampled unknown concepts from the pool to guide the reverse process to generate expected OOD features that deviate from the familiar knowledge. The significant performance gains on three different tasks demonstrate the superiorities of our method. Meanwhile, extensive visualization results show that our method could synthesize effective virtual OOD features.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"9 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144661861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Zero-Shot NAS With Feature Map-Based Proxy and Operation Scoring Function.","authors":"Tangyu Jiang,Haodi Wang,Rongfang Bie,Chun Yuan","doi":"10.1109/tpami.2025.3590342","DOIUrl":"https://doi.org/10.1109/tpami.2025.3590342","url":null,"abstract":"Neural Architecture Search (NAS) has been extensively studied due to its ability in automatic architecture engineering. Existing NAS methods rely heavily on the gradients and data labels, which either incur immense computational costs or suffer from discretization discrepancy due to the supernet structure. Moreover, the majority of them are limited in generating diverse architectures. To alleviate these issues, in this paper, we propose a novel zero-cost proxy called $mathsf {MeCo}$ based on the Pearson correlation matrix of the feature maps. Unlike the previous work, the computation of $mathsf {MeCo}$ as well as its variant $mathsf {MeCo_{opt}}$ requires only one random data for a single forward pass. Based on the proposed zero-cost proxy, we further craft a new zero-shot NAS scheme called $mathsf {FLASH}$, which harnesses a new proxy-based operation scoring function and a greedy heuristic. Compared to the existing methods, $mathsf {FLASH}$ is highly efficient and can construct diverse model architectures instead of repeated cells. We design comprehensive experiments and extensively evaluate our designs on multiple benchmarks and datasets. The experimental results show that our method is one to six orders of magnitudes more efficient than the state-of-the-art baselines with the highest model accuracy.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"10 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144661859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziyi Zhang,Sen Zhang,Li Shen,Yibing Zhan,Yong Luo,Han Hu,Bo Du,Yonggang Wen,Dacheng Tao
{"title":"Aligning Text-to-Image Diffusion Models with Constrained Reinforcement Learning.","authors":"Ziyi Zhang,Sen Zhang,Li Shen,Yibing Zhan,Yong Luo,Han Hu,Bo Du,Yonggang Wen,Dacheng Tao","doi":"10.1109/tpami.2025.3590730","DOIUrl":"https://doi.org/10.1109/tpami.2025.3590730","url":null,"abstract":"Reward finetuning has emerged as a powerful technique for aligning diffusion models with specific downstream objectives or user preferences. However, current approaches suffer from a persistent challenge of reward overoptimization, where models exploit imperfect reward feedback at the expense of overall performance. In this work, we identify three key contributors to overoptimization: (1) a granularity mismatch between the multi-step diffusion process and sparse rewards; (2) a loss of plasticity that limits the model's ability to adapt and generalize; and (3) an overly narrow focus on a single reward objective that neglects complementary performance criteria. Accordingly, we introduce Constrained Diffusion Policy Optimization (CDPO), a novel reinforcement learning framework that addresses reward overoptimization from multiple angles. Firstly, CDPO tackles the granularity mismatch through a temporal policy optimization strategy that delivers step-specific rewards throughout the entire diffusion trajectory, thereby reducing the risk of overfitting to sparse final-step rewards. Then we incorporate a neuron reset strategy that selectively resets overactive neurons in the model, preventing overoptimization induced by plasticity loss. Finally, to avoid overfitting to a narrow reward objective, we integrate constrained reinforcement learning with auxiliary reward objectives serving as explicit constraints, ensuring a balanced optimization across diverse performance metrics.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"73 4 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144661862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Re-GAN: Data-Efficient GANs Training via Architectural Reconfiguration.","authors":"Divya Saxena,Jiannong Cao,Jiahao Xu,Tarun Kulshrestha","doi":"10.1109/tpami.2025.3590650","DOIUrl":"https://doi.org/10.1109/tpami.2025.3590650","url":null,"abstract":"The training of Generative Adversarial Networks (GANs) for high-fidelity images has predominantly relied on large-scale datasets. Emerging research, particularly on GANs 'lottery tickets', suggests that dense GANs models have sparse sub-networks capable of superior performance with limited data. However, the conventional process to uncover these 'lottery tickets' involves a resource-intensive train-prune-retrain cycle. Addressing this, our paper introduces Re-GAN, a novel, dataefficient approach for GANs training that dynamically reconfigures the GANs architecture during training. This method focuses on iterative pruning of non-important connections and regrowing them, thereby preventing premature loss of important features and maintaining the model's representational strength. Re-GAN provides a more stable and efficient solution for GANs models with limited data, offering an alternative to existing progressive growing methods and GANs tickets. While Re-GAN has already demonstrated its potential in image generation across diverse datasets, domains, and resolutions, in this paper, we significantly expand our study. We incorporate new applications, notably Image-to-Image translation, include additional datasets, provide in-depth analyses, and explore compatibility with data augmentation techniques. This expansion not only broadens the scope of Re-GAN but also establishes it as a generic training methodology, demonstrating its effectiveness and adaptability in different GANs scenarios. Code is available at https://github.com/IntellicentAI-lab/Re-GAN.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"672 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144661860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhanced Dual-Pattern Matching with Vision-Language Representation for out-of-Distribution Detection.","authors":"Xiang Xiang,Zhuo Xu,Zihan Zhang,Zhigang Zeng,Xilin Chen","doi":"10.1109/tpami.2025.3590717","DOIUrl":"https://doi.org/10.1109/tpami.2025.3590717","url":null,"abstract":"Out-of-distribution (OOD) detection presents a significant challenge in deploying pattern recognition and machine learning models, as they frequently fail to generalize to data from unseen distributions. Recent advancements in vision-language models (VLMs), particularly CLIP, have demonstrated promising results in OOD detection through their rich multimodal representations. However, current CLIP-based OOD detection methods predominantly rely on single-modality in-distribution (ID) data (e.g., textual cues), overlooking the valuable information contained in ID visual cues. In this work, we demonstrate that incorporating ID visual information is crucial for unlocking CLIP's full potential in OOD detection. We propose a novel approach, Dual-Pattern Matching (DPM), which effectively adapts CLIP for OOD detection by jointly exploiting both textual and visual ID patterns. Specifically, DPM refines visual and textual features through the proposed Domain-Specific Feature Aggregation (DSFA) and Prompt Enhancement (PE) modules. Subsequently, DPM stores class-wise textual features as textual patterns and aggregates ID visual features as visual patterns. During inference, DPM calculates similarity scores relative to both patterns to identify OOD data. Furthermore, we enhance DPM with lightweight adaptation mechanisms to further boost OOD detection performance. Comprehensive experiments demonstrate that DPM surpasses state-of-the-art methods on multiple benchmarks, highlighting the effectiveness of leveraging multimodal information for OOD detection. The proposed dual-pattern approach provides a simple yet robust framework for leveraging vision-language representations in OOD detection tasks.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"109 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144661876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inv-Adapter: ID Customization Generation via Image Inversion and Lightweight Parameter Adapter.","authors":"Peng Xing,Ning Wang,Jianbo Ouyang,Zechao Li","doi":"10.1109/tpami.2025.3590321","DOIUrl":"https://doi.org/10.1109/tpami.2025.3590321","url":null,"abstract":"The remarkable advancement in text-to-image generation models significantly boosts the research in ID customization generation. However, existing personalization methods cannot simultaneously satisfy high-fidelity and low-costs requirements. Their main bottleneck lies in the additional prompt image encoder (i.e., CLIP vision encoder), which produces weak alignment signals with the text-to-image model that may lose face information and is not well 'absorbed' by the text-to-image model. Towards this end, we propose Inv-Adapter, which first introduces a more reasonable and efficient token representation of ID image features and introduces a lightweight parameter adaptor to inject ID features. Specifically, our Inv-Adapter extracts diffusion-domain representations of ID images utilizing a pre-trained text-to-image model via DDIM image inversion, without an additional image encoder. Benefiting from the high alignment of the extracted ID prompt features and the intermediate features of the text-to-image model, we then introduce a lightweight attention adapter to embed them efficiently into the base text-to-image model. We conduct extensive experiments on different text-to-image models to assess ID fidelity, generation loyalty, speed, training costs, model scale and generalization ability in scenarios of general object, all of which show that the proposed Inv-Adapter is highly competitive in ID customization generation and model scale.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"24 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144652574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multilingual-Prompt-Guided Directional Feature Learning for Weakly Supervised Video Anomaly Detection.","authors":"Chizhuo Xiao,Yang Xiao,Joey Tianyi Zhou,Zhiwen Fang","doi":"10.1109/tpami.2025.3590242","DOIUrl":"https://doi.org/10.1109/tpami.2025.3590242","url":null,"abstract":"Weakly supervised video anomaly detection has gained attention for its effective performance and cost-efficient annotation, using video-level labels to distinguish between normal and abnormal patterns. However, challenges arise from the diversity and incompleteness of anomalous events, complicating feature learning. Vision-language models offer promising approaches, but designing precise prompts remains difficult. This is because accommodating the diverse range of normal and anomalous scenarios in real-world settings is challenging, and the workload is significant. To tackle these issues, we propose integrating multilingualism and multiple prompts to improve feature learning. By utilizing prompts in various languages to define \"anomaly\" and \"normalcy,\" we tackle these concepts across different linguistic domains. In each domain, multiple prompts are employed for adaptive top-K prompt selection of snippets. To enhance visual feature learning, a multi-granularity attention module combining Transformer and Mamba is designed. Mamba's long-range adaptation selection builds fine-grained temporal correlations among coarse-grained snippets, while Transformer enhances fine-grained information guided by coarse-grained information. Alongside a multilingual prompt guidance loss, we introduce a gradual directional loss to jointly optimize visual feature distribution and the top-K prompt selection. Our method demonstrates effectiveness on four video datasets and provides generalizability analyses on two medical datasets, including EMG and ECG temporal data.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"18 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144652910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yansong Tang,Aoyang Liu,Jinpeng Liu,Shiyi Zhang,Wenxun Dai,Jie Zhou,Xiu Li,Jiwen Lu
{"title":"FLAG3D++: A Benchmark for 3D Fitness Activity Comprehension With Language Instruction.","authors":"Yansong Tang,Aoyang Liu,Jinpeng Liu,Shiyi Zhang,Wenxun Dai,Jie Zhou,Xiu Li,Jiwen Lu","doi":"10.1109/tpami.2025.3590012","DOIUrl":"https://doi.org/10.1109/tpami.2025.3590012","url":null,"abstract":"Recent years have witnessed the rapid development of general human action understanding. However, when applied to real-world applications such as sports analysis, most existing datasets are still unsatisfactory, because of the limitations in rich labels on multiple tasks, language instructions, high-quality 3D data, and diverse environments. In this paper, we present FLAG3D++, a large-scale benchmark for 3D fitness activity comprehension, which contains 180 K sequences of 60 activity categories with language instruction. FLAG3D++ features the following four aspects: 1) fine-grained annotations of the temporal intervals of actions in the untrimmed long sequences and how well these actions are performed, 2) detailed and professional language instruction to describe how to perform a specific activity, 3) accurate and dense 3D human pose captured from advanced MoCap system to handle the complex activity and large movement, 4) versatile video resources from a high-tech MoCap system, rendering software, and cost-effective smartphones in natural environments. In light of the specified features, we present two new practical applications as language-guided repetition action counting (L-RAC) and language-guided action quality assessment (L-AQA), which aim to take the language descriptions as references to count the repetitive times of an action and assess the quality of action respectively. Furthermore, we propose a Hierarchical Language-Guided Graph Convolutional Network (HL-GCN) model to better fuse the language information and skeleton sequences for L-RAC and L-AQA. To be specific, the HL-GCN performs cross-modal alignments by the early fusion of the linguistic feature and the hierarchical node features of the skeleton-based sequences encoded by the multiple intermediate graph convolutional layers. Extensive experiments show the superiority of our HL-GCN on both L-RAC and L-AQA, as well as the great research value of FLAG3D++ for various challenges, such as dynamic human mesh recovery and cross-domain human action recognition. Our dataset, source code, and trained models are made publicly available at FLAG3D++.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"14 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144652577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pro-NeXt: An All-in-One Unified Model for General Fine-Grained Visual Recognition.","authors":"Junde Wu,Jiayuan Zhu,Min Xu,Yueming Jin","doi":"10.1109/tpami.2025.3584902","DOIUrl":"https://doi.org/10.1109/tpami.2025.3584902","url":null,"abstract":"Unlike general visual classification (CLS) tasks, certain CLS problems are significantly more challenging as they involve recognizing professionally categorized or highly specialized images. Fine-Grained Visual Classification (FGVC) has emerged as a broad solution to address this complexity. However, most existing methods have been predominantly evaluated on a limited set of homogeneous benchmarks, such as bird species or vehicle brands. Moreover, these approaches often train separate models for each specific task, which restricts their generalizability. This paper proposes a scalable and explainable foundational model designed to tackle a wide range of FGVC tasks from a unified and generalizable perspective. We introduce a novel architecture named Pro-NeXt and reveal that Pro-NeXt exhibits substantial generalizability across diverse professional fields such as fashion, medicine, and art areas, previously considered disparate. Our basic-sized Pro-NeXt-B surpasses all preceding task-specific models across 12 distinct datasets within 5 diverse domains. Furthermore, we find its good scaling property that scaling up Pro-NeXt in depth and width with increasing GFlops can consistently enhance its accuracy. Beyond scalability and adaptability, the intermediate features of Pro-NeXt achieve reliable object detection and segmentation performance without extra training, highlighting its solid explainability. We will release the code to promote further research in this area.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"13 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144645756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}