{"title":"Say No to Freeloader: Protecting Intellectual Property of Your Deep Model","authors":"Lianyu Wang;Meng Wang;Huazhu Fu;Daoqiang Zhang","doi":"10.1109/TPAMI.2024.3450282","DOIUrl":"10.1109/TPAMI.2024.3450282","url":null,"abstract":"Model intellectual property (IP) protection has gained attention due to the significance of safeguarding intellectual labor and computational resources. Ensuring IP safety for trainers and owners is critical, especially when ownership verification and applicability authorization are required. A notable approach involves preventing the transfer of well-trained models from authorized to unauthorized domains. We introduce a novel Compact Un-transferable Pyramid Isolation Domain (CUPI-Domain) which serves as a barrier against illegal transfers from authorized to unauthorized domains. Inspired by human transitive inference, the CUPI-Domain emphasizes distinctive style features of the authorized domain, leading to failure in recognizing irrelevant private style features on unauthorized domains. To this end, we propose CUPI-Domain generators, which select features from both authorized and CUPI-Domain as anchors. These generators fuse the style features and semantic features to create labeled, style-rich CUPI-Domain. Additionally, we design external Domain-Information Memory Banks (DIMB) for storing and updating labeled pyramid features to obtain stable domain class features and domain class-wise style features. Based on the proposed whole method, the novel style and discriminative loss functions are designed to effectively enhance the distinction in style and discriminative features between authorized and unauthorized domains. We offer two solutions for utilizing CUPI-Domain based on whether the unauthorized domain is known: target-specified CUPI-Domain and target-free CUPI-Domain. Comprehensive experiments on various public datasets demonstrate the effectiveness of our CUPI-Domain approach with different backbone models, providing an efficient solution for model intellectual property protection.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11073-11086"},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142074880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gaseous Object Detection","authors":"Kailai Zhou;Yibo Wang;Tao Lv;Qiu Shen;Xun Cao","doi":"10.1109/TPAMI.2024.3449994","DOIUrl":"10.1109/TPAMI.2024.3449994","url":null,"abstract":"Object detection, a fundamental and challenging problem in computer vision, has experienced rapid development due to the effectiveness of deep learning. The current objects to be detected are mostly rigid solid substances with apparent and distinct visual characteristics. In this paper, we endeavor on a scarcely explored task named Gaseous Object Detection (GOD), which is undertaken to explore whether the object detection techniques can be extended from solid substances to gaseous substances. Nevertheless, the gas exhibits significantly different visual characteristics: 1) saliency deficiency, 2) arbitrary and ever-changing shapes, 3) lack of distinct boundaries. To facilitate the study on this challenging task, we construct a GOD-Video dataset comprising 600 videos (141,017 frames) that cover various attributes with multiple types of gases. A comprehensive benchmark is established based on this dataset, allowing for a rigorous evaluation of frame-level and video-level detectors. Deduced from the Gaussian dispersion model, the physics-inspired Voxel Shift Field (VSF) is designed to model geometric irregularities and ever-changing shapes in potential 3D space. By integrating VSF into Faster RCNN, the VSF RCNN serves as a simple but strong baseline for gaseous object detection. Our work aims to attract further research into this valuable albeit challenging area.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10715-10731"},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142074852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Huang;Qi Zhang;Ying Feng;Hongdong Li;Qing Wang
{"title":"LTM-NeRF: Embedding 3D Local Tone Mapping in HDR Neural Radiance Field","authors":"Xin Huang;Qi Zhang;Ying Feng;Hongdong Li;Qing Wang","doi":"10.1109/TPAMI.2024.3448620","DOIUrl":"10.1109/TPAMI.2024.3448620","url":null,"abstract":"Recent advances in Neural Radiance Fields (NeRF) have provided a new geometric primitive for novel view synthesis. High Dynamic Range NeRF (HDR NeRF) can render novel views with a higher dynamic range. However, effectively displaying the scene contents of HDR NeRF on diverse devices with limited dynamic range poses a significant challenge. To address this, we present LTM-NeRF, a method designed to recover HDR NeRF and support 3D local tone mapping. LTM-NeRF allows for the synthesis of HDR views, tone-mapped views, and LDR views under different exposure settings, using only the multi-view multi-exposure LDR inputs for supervision. Specifically, we propose a differentiable Camera Response Function (CRF) module for HDR NeRF reconstruction, globally mapping the scene’s HDR radiance to LDR pixels. Moreover, we introduce a Neural Exposure Field (NeEF) to represent the spatially varying exposure time of an HDR NeRF to achieve 3D local tone mapping, for compatibility with various displays. Comprehensive experiments demonstrate that our method can not only synthesize HDR views and exposure-varying LDR views accurately but also render locally tone-mapped views naturally.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10944-10959"},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142044272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LIA: Latent Image Animator","authors":"Yaohui Wang;Di Yang;Francois Bremond;Antitza Dantcheva","doi":"10.1109/TPAMI.2024.3449075","DOIUrl":"10.1109/TPAMI.2024.3449075","url":null,"abstract":"Previous animation techniques mainly focus on leveraging explicit structure representations (\u0000<italic>e.g.</i>\u0000, meshes or keypoints) for transferring motion from driving videos to source images. However, such methods are challenged with large appearance variations between source and driving data, as well as require complex additional modules to respectively model appearance and motion. Towards addressing these issues, we introduce the Latent Image Animator (LIA), streamlined to animate high-resolution images. LIA is designed as a simple autoencoder that does not rely on explicit representations. Motion transfer in the pixel space is modeled as linear navigation of motion codes in the latent space. Specifically such navigation is represented as an orthogonal motion dictionary learned in a self-supervised manner based on proposed Linear Motion Decomposition (LMD). Extensive experimental results demonstrate that LIA outperforms state-of-the-art on VoxCeleb, TaichiHD, and TED-talk datasets with respect to video quality and spatio-temporal consistency. In addition LIA is well equipped for zero-shot high-resolution image animation. Code, models, and demo video are available at \u0000<uri>https://github.com/wyhsirius/LIA</uri>\u0000.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10829-10844"},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142044271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correlation-Embedded Transformer Tracking: A Single-Branch Framework","authors":"Fei Xie;Wankou Yang;Chunyu Wang;Lei Chu;Yue Cao;Chao Ma;Wenjun Zeng","doi":"10.1109/TPAMI.2024.3448254","DOIUrl":"10.1109/TPAMI.2024.3448254","url":null,"abstract":"Developing robust and discriminative appearance models has been a long-standing research challenge in visual object tracking. In the prevalent Siamese-based paradigm, the features extracted by the Siamese-like networks are often insufficient to model the tracked targets and distractor objects, thereby hindering them from being robust and discriminative simultaneously. While most Siamese trackers focus on designing robust correlation operations, we propose a novel single-branch tracking framework inspired by the transformer. Unlike the Siamese-like feature extraction, our tracker deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it can suppress non-target features, resulting in target-aware feature extraction. The output features can be directly used to predict target locations without additional correlation steps. Thus, we reformulate the two-branch Siamese tracking as a conceptually simple, fully transformer-based Single-Branch Tracking pipeline, dubbed SBT. After conducting an in-depth analysis of the SBT baseline, we summarize many effective design principles and propose an improved tracker dubbed SuperSBT. SuperSBT adopts a hierarchical architecture with a local modeling layer to enhance shallow-level features. A unified relation modeling is proposed to remove complex handcrafted layer pattern designs. SuperSBT is further improved by masked image modeling pre-training, integrating temporal modeling, and equipping with dedicated prediction heads. Thus, SuperSBT outperforms the SBT baseline by 4.7%,3.0%, and 4.5% AUC scores in LaSOT, TrackingNet, and GOT-10K. Notably, SuperSBT greatly raises the speed of SBT from 37 FPS to 81 FPS. Extensive experiments show that our method achieves superior results on eight VOT benchmarks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10681-10696"},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142037963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multimodal Cross-Lingual Summarization for Videos: A Revisit in Knowledge Distillation Induced Triple-Stage Training Method","authors":"Nayu Liu;Kaiwen Wei;Yong Yang;Jianhua Tao;Xian Sun;Fanglong Yao;Hongfeng Yu;Li Jin;Zhao Lv;Cunhang Fan","doi":"10.1109/TPAMI.2024.3447778","DOIUrl":"10.1109/TPAMI.2024.3447778","url":null,"abstract":"Multimodal summarization (MS) for videos aims to generate summaries from multi-source information (e.g., video and text transcript), showing promising progress recently. However, existing works are limited to monolingual scenarios, neglecting non-native viewers' needs to understand videos in other languages. It stimulates us to introduce multimodal cross-lingual summarization for videos (MCLS), which aims to generate cross-lingual summaries from multimodal input of videos. Considering the challenge of high annotation cost and resource constraints in MCLS, we propose a knowledge distillation (KD) induced triple-stage training method to assist MCLS by transferring knowledge from abundant monolingual MS data to those data with insufficient volumes. In the triple-stage training method, a video-guided dual fusion network (VDF) is designed as the backbone network to integrate multimodal and cross-lingual information through diverse fusion strategies in the encoder and decoder; What's more, we propose two cross-lingual knowledge distillation strategies: adaptive pooling distillation and language-adaptive warping distillation (LAWD), designed for encoder-level and vocab-level distillation objects to facilitate effective knowledge transfer across cross-lingual sequences of varying lengths between MS and MCLS models. Specifically, to tackle lingual sequences of varying lengths between MS and MCLS models. Specifically, to tackle the challenge of unequal length of parallel cross-language sequences in KD, LAWD can directly conduct cross-language distillation while keeping the language feature shape unchanged to reduce potential information loss. We meticulously annotated the How2-MCLS dataset based on the How2 dataset to simulate MCLS scenarios. Experimental results show that the proposed method achieves competitive performance compared to strong baselines, and can bring substantial performance improvements to MCLS models by transferring knowledge from the MS model.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10697-10714"},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142037964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Survey on Continual Semantic Segmentation: Theory, Challenge, Method and Application","authors":"Bo Yuan;Danpei Zhao","doi":"10.1109/TPAMI.2024.3446949","DOIUrl":"10.1109/TPAMI.2024.3446949","url":null,"abstract":"Continual learning, also known as incremental learning or life-long learning, stands at the forefront of deep learning and AI systems. It breaks through the obstacle of one-way training on close sets and enables continuous adaptive learning on open-set conditions. In the recent decade, continual learning has been explored and applied in multiple fields especially in computer vision covering classification, detection and segmentation tasks. Continual semantic segmentation (CSS), of which the dense prediction peculiarity makes it a challenging, intricate and burgeoning task. In this paper, we present a review of CSS, committing to building a comprehensive survey on problem formulations, primary challenges, universal datasets, neoteric theories and multifarious applications. Concretely, we begin by elucidating the problem definitions and primary challenges. Based on an in-depth investigation of relevant approaches, we sort out and categorize current CSS models into two main branches including \u0000<italic>data-replay</i>\u0000 and \u0000<italic>data-free</i>\u0000 sets. In each branch, the corresponding approaches are similarity-based clustered and thoroughly analyzed, following qualitative comparison and quantitative reproductions on relevant datasets. Besides, we also introduce four CSS specialities with diverse application scenarios and development tendencies. Furthermore, we develop a benchmark for CSS encompassing representative references, evaluation results and reproductions. We hope this survey can serve as a reference-worthy and stimulating contribution to the advancement of the life-long learning field, while also providing valuable perspectives for related fields.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10891-10910"},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Survey on Deep Neural Network Pruning: Taxonomy, Comparison, Analysis, and Recommendations","authors":"Hongrong Cheng;Miao Zhang;Javen Qinfeng Shi","doi":"10.1109/TPAMI.2024.3447085","DOIUrl":"10.1109/TPAMI.2024.3447085","url":null,"abstract":"Modern deep neural networks, particularly recent large language models, come with massive model sizes that require significant computational and storage resources. To enable the deployment of modern models on resource-constrained environments and to accelerate inference time, researchers have increasingly explored pruning techniques as a popular research direction in neural network compression. More than three thousand pruning papers have been published from 2020 to 2024. However, there is a dearth of up-to-date comprehensive review papers on pruning. To address this issue, in this survey, we provide a comprehensive review of existing research works on deep neural network pruning in a taxonomy of 1) universal/specific speedup, 2) when to prune, 3) how to prune, and 4) fusion of pruning and other compression techniques. We then provide a thorough comparative analysis of eight pairs of contrast settings for pruning (e.g., unstructured/structured, one-shot/iterative, data-free/data-driven, initialized/pre-trained weights, etc.) and explore several emerging topics, including pruning for large language models, vision transformers, diffusion models, and large multimodal models, post-training pruning, and different levels of supervision for pruning to shed light on the commonalities and differences of existing methods and lay the foundation for further method development. Finally, we provide some valuable recommendations on selecting pruning methods and prospect several promising research directions for neural network pruning. To facilitate future research on deep neural network pruning, we summarize broad pruning applications (e.g., adversarial robustness, natural language understanding, etc.) and build a curated collection of datasets, networks, and evaluations on different applications. We maintain a repository on \u0000<uri>https://github.com/hrcheng1066/awesome-pruning</uri>\u0000 that serves as a comprehensive resource for neural network pruning papers and corresponding open-source codes. We will keep updating this repository to include the latest advancements in the field.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10558-10578"},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Medical Image Segmentation Review: The Success of U-Net","authors":"Reza Azad;Ehsan Khodapanah Aghdam;Amelie Rauland;Yiwei Jia;Atlas Haddadi Avval;Afshin Bozorgpour;Sanaz Karimijafarbigloo;Joseph Paul Cohen;Ehsan Adeli;Dorit Merhof","doi":"10.1109/TPAMI.2024.3435571","DOIUrl":"10.1109/TPAMI.2024.3435571","url":null,"abstract":"Automatic medical image segmentation is a crucial topic in the medical domain and successively a critical counterpart in the computer-aided diagnosis paradigm. U-Net is the most widespread image segmentation architecture due to its flexibility, optimized modular design, and success in all medical image modalities. Over the years, the U-Net model has received tremendous attention from academic and industrial researchers who have extended it to address the scale and complexity created by medical tasks. These extensions are commonly related to enhancing the U-Net's backbone, bottleneck, or skip connections, or including representation learning, or combining it with a Transformer architecture, or even addressing probabilistic prediction of the segmentation map. Having a compendium of different previously proposed U-Net variants makes it easier for machine learning researchers to identify relevant research questions and understand the challenges of the biological tasks that challenge the model. In this work, we discuss the practical aspects of the U-Net model and organize each variant model into a taxonomy. Moreover, to measure the performance of these strategies in a clinical application, we propose fair evaluations of some unique and famous designs on well-known datasets. Furthermore, we provide a comprehensive implementation library with trained models. In addition, for ease of future studies, we created an online list of U-Net papers with their possible official implementation.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10076-10095"},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"U-Match: Exploring Hierarchy-Aware Local Context for Two-View Correspondence Learning","authors":"Zizhuo Li;Shihua Zhang;Jiayi Ma","doi":"10.1109/TPAMI.2024.3447048","DOIUrl":"10.1109/TPAMI.2024.3447048","url":null,"abstract":"Rejecting outlier correspondences is one of the critical steps for successful feature-based two-view geometry estimation, and contingent heavily upon local context exploration. Recent advances focus on devising elaborate local context extractors whereas typically adopting \u0000<italic>explicit</i>\u0000 neighborhood relationship modeling at a specific scale, which is intrinsically flawed and inflexible, because 1) severe outliers often populated in putative correspondences and 2) the uncertainty in the distribution of inliers and outliers make the network incapable of capturing adequate and reliable local context from such neighborhoods, therefore resulting in the failure of pose estimation. This prospective study proposes a novel network called U-Match that has the flexibility to enable \u0000<italic>implicit</i>\u0000 local context awareness at multiple levels, naturally circumventing the aforementioned issues that plague most existing studies. Specifically, to aggregate multi-level local context implicitly, a hierarchy-aware graph representation module is designed to flexibly encode and decode hierarchical features. Moreover, considering that global context always works collaboratively with local context, an orthogonal local-and-global information fusion module is presented to integrate complementary local and global context in a redundancy-free manner, thus yielding compact feature representations to facilitate correspondence learning. Thorough experimentation across relative pose estimation, homography estimation, visual localization, and point cloud registration affirms U-Match's remarkable capabilities.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10960-10977"},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}