IEEE transactions on pattern analysis and machine intelligence最新文献

筛选
英文 中文
Dataset Distillation via a Noise-Unconstrained Generative Model. 基于无噪声约束生成模型的数据集蒸馏。
IF 18.6
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2026-05-06 DOI: 10.1109/TPAMI.2026.3690778
Jingxuan Zhang, Lei Dai, Fei Ye, Zhihua Chen, Ping Li, Xiaokang Yang, Bin Sheng
{"title":"Dataset Distillation via a Noise-Unconstrained Generative Model.","authors":"Jingxuan Zhang, Lei Dai, Fei Ye, Zhihua Chen, Ping Li, Xiaokang Yang, Bin Sheng","doi":"10.1109/TPAMI.2026.3690778","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3690778","url":null,"abstract":"<p><p>Dataset distillation (DD) aims to synthesize a more compact dataset than the original one and models trained on it are expected to have the same generalization capabilities as on the original dataset. Previous work via a generative model (GM) faces several limitations. First, GM struggles to generate representative samples due to a lack of constraints. Second, it overlooks the relationships between generated samples, limiting its effectiveness. In this paper, a new noise-unconstrained GM-based DD framework is proposed. In the distillation stage, an adaptive matching coefficient is introduced to align generated images with representative class elements and the MiniMax loss function is extended to reduce the optimization difficulty. In the deployment stage, features among each generative image are ensembled by gradient-matching based DD. Theoretical analysis based on McDiarmid's inequality demonstrates that the proposed components can reduce the generalization error of the original baseline method. We also provide insights into the potential of generated images as an effective proxy dataset for DD. For example, on the ImageWoof dataset with 50 distilled images per class using a 6-layer ConvNet for evaluation, generated images outperform 25%, 50%, and 75% original images by 8.4%, 6.3%, and 8.3% in distillation performance. Our method effectively handles both low- and high-resolution datasets, with experiments on 11 benchmarks demonstrating its efficacy.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity. DP-SfM:无尺度模糊的双像素运动结构。
IF 18.6
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2026-05-05 DOI: 10.1109/TPAMI.2026.3690655
Lilika Makabe, Kohei Ashida, Hiroaki Santo, Fumio Okura, Yasuyuki Matsushita
{"title":"DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity.","authors":"Lilika Makabe, Kohei Ashida, Hiroaki Santo, Fumio Okura, Yasuyuki Matsushita","doi":"10.1109/TPAMI.2026.3690655","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3690655","url":null,"abstract":"<p><p>Multi-view 3D reconstruction, namely, structure-from-motion followed by multi-view stereo, is a fundamental component of 3D computer vision. In general, multi-view 3D reconstruction suffers from an unknown scale ambiguity unless a reference object of known size is present in the scene. In this article, we show that multi-view images captured using a dual-pixel (DP) sensor can automatically resolve the scale ambiguity, without requiring a reference object or prior calibration. Specifically, the defocus blur observed in DP images provides sufficient information to determine the absolute scale when paired with depth maps (up to scale) recovered from multi-view 3D reconstruction. Based on this observation, we develop a simple yet effective linear method to estimate the absolute scale, followed by the intensity-based optimization stage that aligns the left and right DP images by shifting them back toward each other using cross-view blur kernels. Experiments demonstrate the effectiveness of the proposed approach across diverse scenes captured with different cameras and lenses. Code and data are available at https://github.com/lilika-makabe/dp-sfm-tpami.git.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OmniCharacter++: Towards Comprehensive Benchmark for Realistic Role-Playing Agents. omniccharacter++:面向现实角色扮演代理的综合基准。
IF 18.6
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2026-05-05 DOI: 10.1109/TPAMI.2026.3690447
Haonan Zhang, Pengpeng Zeng, Ji Zhang, Jingkuan Song, Nicu Sebe, Heng Tao Shen, Lianli Gao
{"title":"OmniCharacter++: Towards Comprehensive Benchmark for Realistic Role-Playing Agents.","authors":"Haonan Zhang, Pengpeng Zeng, Ji Zhang, Jingkuan Song, Nicu Sebe, Heng Tao Shen, Lianli Gao","doi":"10.1109/TPAMI.2026.3690447","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3690447","url":null,"abstract":"<p><p>Existing Role-Playing Agents (RPAs), powered by large language models, are predominantly evaluated on static, text-only, dyadic conversations, which inadequately reflect the complexity of realistic human interactions involving multiple interlocutors and multi-modal communication. To bridge this gap, we propose OmniCharacter++, the first benchmark for evaluating multi-character interactions in a joint text-speech context. Specifically, OmniCharacter++ contributes: (1) a large-scale dataset comprising 10,287 characters, 118,017 multi-turn dialogues, and over one million audio responses across 8 open-world topics and 31 subfields, covering diverse multi-modal role-playing scenarios; (2) a comprehensive evaluation suite for dialogue understanding, generation quality, and perceptual naturalness; and (3) UniCharacter-7B, a unified text-speech model trained on this dataset to manage complex multi-character dynamics, ensuring both role-specific vocal fidelity and cross-participant semantic alignment. Experimental results demonstrate that UniCharacter-7B achieves more realistic and consistent role-playing responses in terms of both attractiveness and consistency, while also highlighting that OmniCharacter++ poses substantial challenges for state-of-the-art models, charting a clear path for future research. The Code is publicly available at: https://github.com/zchoi/OmniCharacter-plus.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simulating the Real World: A Unified Survey of Multimodal Generative Models. 模拟真实世界:多模态生成模型的统一调查。
IF 18.6
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2026-05-05 DOI: 10.1109/TPAMI.2026.3690925
Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, Hui Xiong
{"title":"Simulating the Real World: A Unified Survey of Multimodal Generative Models.","authors":"Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, Hui Xiong","doi":"10.1109/TPAMI.2026.3690925","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3690925","url":null,"abstract":"<p><p>Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D, and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics, and future directions to foster insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Trusted Multi-View Learning under Noisy Supervision. 噪声监督下的可信多视图学习。
IF 18.6
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2026-05-05 DOI: 10.1109/TPAMI.2026.3690466
Yilin Zhang, Cai Xu, Han Jiang, Ziyu Guan, Wei Zhao, Xiaofei He, Murat Sensoy
{"title":"Trusted Multi-View Learning under Noisy Supervision.","authors":"Yilin Zhang, Cai Xu, Han Jiang, Ziyu Guan, Wei Zhao, Xiaofei He, Murat Sensoy","doi":"10.1109/TPAMI.2026.3690466","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3690466","url":null,"abstract":"<p><p>Multi-view learning methods often focus on improving decision accuracy while neglecting the decision uncertainty, which significantly restricts their applications in safety-critical scenarios. To address this, trusted multi-view learning methods estimate prediction uncertainties by learning class distributions from each instance. However, these methods heavily rely on high-quality ground-truth labels. This motivates us to delve into a new problem: how to develop a reliable multi-view learning model under the guidance of noisy labels? We propose the Trusted Multi-view Noise Refining (TMNR) method to address this challenge by modeling label noise arising from low-quality data features and easily-confused classes. TMNR employs evidential deep neural networks to construct view-specific opinions that capture both beliefs and uncertainty. These opinions are then transformed through noise correlation matrices to align with the noisy supervision, where matrix elements are constrained by sample uncertainty to reflect label reliability. Furthermore, considering the challenge of jointly optimizing the evidence network and noise correlation matrices under noisy supervision, we further propose Trusted Multi-view Noise Re-Refining (TMNR$^{mathbf{2}}$), which disentangles this complex co-training problem by establishing different training objectives for distinct modules. TMNR$^{mathbf{2}}$ identifies potentially mislabeled samples through evidence-label consistency and generates pseudo-labels from neighboring information. By assigning clean samples to optimize evidential networks and noisy samples to guide noise correlation matrices, respectively, TMNR$^{mathbf{2}}$ reduces mapping interference and achieves stabilized training. We empirically evaluate our methods against state-of-the-art baselines on 7 multi-view datasets. Experimental results demonstrate that TMNR$^{mathbf{2}}$ significantly outperforms baseline methods, with average accuracy improvements of 7% on datasets with 50% label noise. The code and appendix are released at https://github.com/YilinZhang107/TMNRR.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IntentQA: Intent Question Answering in Videos by Cognitive Context Reasoning. IntentQA:通过认知语境推理在视频中回答意图问题。
IF 18.6
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2026-05-05 DOI: 10.1109/TPAMI.2026.3690561
Jiapeng Li, Ping Wei, Wenjuan Han, Song-Chun Zhu, Lifeng Fan
{"title":"IntentQA: Intent Question Answering in Videos by Cognitive Context Reasoning.","authors":"Jiapeng Li, Ping Wei, Wenjuan Han, Song-Chun Zhu, Lifeng Fan","doi":"10.1109/TPAMI.2026.3690561","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3690561","url":null,"abstract":"<p><p>Video understanding requires intelligent agents to transcend mere recognition of visual facts and comprehend the underlying intents behind human actions-often termed the \"dark matter\" of social intelligence. To bridge the gap between visual observation and intent reasoning, we introduce a novel task, IntentQA, and contribute a large-scale VideoQA dataset specifically tailored for this purpose. However, recognizing that standard metrics may overestimate capabilities due to dataset biases, we go beyond simple accuracy to rigorously evaluate model robustness. We augment the benchmark by generating five distinct contrast sets via Large Language Models (LLMs) and introducing a \"Contrast Performance Decline\" metric. We propose the X-CaVIR (eXplainable Context-aware Video Intent Reasoning) framework, which leverages three types of \"Cognitive Context\" to enhance video analysis: i) Situational Context via a cross-modal Video Query Language (VQL) module, ii) Contrastive Context via a Contrastive Learning module, and iii) Commonsense Context via a Commonsense Reasoning module. Crucially, to overcome the opacity of traditional black-box models, we refine the integration of LLMs within X-CaVIR by employing a transparent pipeline that synergizes video captions with VQA model outputs. This approach not only improves performance by effectively utilizing rich commonsense knowledge but also renders the reasoning process explicitly interpretable. Extensive experiments demonstrate the effectiveness of our components, the superiority of X-CaVIR over state-of-the-art baselines, and its stability against perturbations on the contrast sets. The dataset and codes are open-sourced at: https://github.com/JoseponLee/IntentQA.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spectral-Adaptive Modulation Networks for Visual Perception. 视觉感知光谱自适应调制网络。
IF 18.6
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2026-05-05 DOI: 10.1109/TPAMI.2026.3690455
Guhnoo Yun, Juhan Yoo, Kijung Kim, Jeongho Lee, Paul Hongsuck Seo, Dong Hwan Kim
{"title":"Spectral-Adaptive Modulation Networks for Visual Perception.","authors":"Guhnoo Yun, Juhan Yoo, Kijung Kim, Jeongho Lee, Paul Hongsuck Seo, Dong Hwan Kim","doi":"10.1109/TPAMI.2026.3690455","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3690455","url":null,"abstract":"<p><p>Recent studies have shown that 2D convolution and self-attention exhibit distinct spectral behaviors, and optimizing their spectral properties can enhance vision model performance. However, theoretical analyses remain limited in explaining why 2D convolution is more effective in high-pass filtering than self-attention and why larger kernels favor shape bias, akin to self-attention. In this paper, we employ graph spectral analysis to theoretically simulate and compare the frequency responses of 2D convolution and self-attention within a unified framework. Our results corroborate previous empirical findings and reveal that node connectivity, modulated by window size, is a key factor in shaping spectral functions. Leveraging this insight, we introduce a spectral-adaptive modulation (SPAM) mixer, which processes visual features in a spectral-adaptive manner using multi-scale convolutional kernels and a spectral re-scaling mechanism to refine spectral components. Based on SPAM, we develop SPANetV2 as a novel vision backbone. Extensive experiments demonstrate that SPANetV2 outperforms state-of-the-art models across multiple vision tasks, including ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
InfBA: Interference-Free Bottleneck Adaptation for Continual Learning. 持续学习的无干扰瓶颈适应。
IF 18.6
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2026-05-05 DOI: 10.1109/TPAMI.2026.3690676
Yan-Shuo Liang, Wu-Jun Li
{"title":"InfBA: Interference-Free Bottleneck Adaptation for Continual Learning.","authors":"Yan-Shuo Liang, Wu-Jun Li","doi":"10.1109/TPAMI.2026.3690676","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3690676","url":null,"abstract":"<p><p>Continual learning requires the model to learn multiple tasks sequentially. In continual learning, the model should possess the ability to maintain its performance on old tasks (stability) and the ability to adapt to new tasks continuously (plasticity). Recently, parameter-efficient fine-tuning (PEFT), which involves freezing a pre-trained model and injecting a small number of learnable parameters to adapt to downstream tasks, has gained increasing popularity in continual learning. Although existing continual learning methods based on PEFT have demonstrated superior performance compared to those not based on PEFT, most of them do not consider how to eliminate the interference of the new task on the old tasks, which inhibits the model from making a good trade-off between stability and plasticity. In this work, we propose a new PEFT method, called interference-free bottleneck adaptation (InfBA), for continual learning. InfBA adopts a bottleneck architecture, which decreases the dimensionality of the embedding first and then increases it. Since bottleneck architecture has been utilized by many existing PEFT methods such as Adapter, LoRA and Prefix-tuning, InfBA provides a framework to integrate with these methods. InfBA constrains the update within a subspace, and designs this subspace to eliminate the interference of the new task on the old tasks, making a good trade-off between stability and plasticity. Experimental results on multiple datasets show that our methods consistently outperform existing state-of-the-art continual learning methods based on PEFT.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fractal-domain Vision Graph Neural Network for Remote Sensing Ground Target Classification. 遥感地面目标分类的分形域视觉图神经网络。
IF 18.6
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2026-05-05 DOI: 10.1109/TPAMI.2026.3690544
Jiacheng Yin, Tao Zhen, Gang Xiong, Wenxian Yu
{"title":"Fractal-domain Vision Graph Neural Network for Remote Sensing Ground Target Classification.","authors":"Jiacheng Yin, Tao Zhen, Gang Xiong, Wenxian Yu","doi":"10.1109/TPAMI.2026.3690544","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3690544","url":null,"abstract":"<p><p>To the best of our knowledge, this paper is the first to integrate fractal signal processing with vision graph neural networks, establishing a new graph representation learning paradigm consistent with fractal dynamics. Building on this foundation, we propose a Fractal-domain Vision Graph Neural Network (FD-ViG). Specifically, FD-ViG includes: (i) a Fractal-Domain Learning Module that maps images into the fractal-domain using local Hölder exponents and the Singularity Power Spectrum (SPS), enabling fractal-spatial feature fusion; (ii) a Fractal Graph Construction Module that adaptively generates a topology by combining semantic attention with fractal similarity in the fractal feature space; and (iii) a Graph Propagation Module with power-law multi-scale propagation to realize cross-scale diffusion and aggregation, enabling coupled texture-structure learning. Experiments on UCMerced, RSSCN7, and SIRI-WHU achieve overall accuracies of 91.75%, 89.52%, and 92.78%, respectively. Compared with representative vision graph models such as ViG, WiGNet, and ViHGNN, our method achieves consistent improvements over prior methods across all three datasets, while remaining lightweight (2.6M parameters). Moreover, despite having far fewer parameters than ResNet-18, our model yields competitive or better performance on two datasets, and further demonstrates strong generalization ability in cross-dataset evaluation on SAR imagery. This work provides a principled and effective bridge between fractal theory and graph deep learning, benefiting interpretable remote sensing scene understanding under complex textures and structures.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchical Mesh Representation Learning With Spectral Dictionary Embedding. 基于谱字典嵌入的分层网格表示学习。
IF 18.6
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2026-05-04 DOI: 10.1109/TPAMI.2026.3690051
Zhongpai Gao, Junchi Yan, Tianyu Luan, Guangtao Zhai, Xiaokang Yang
{"title":"Hierarchical Mesh Representation Learning With Spectral Dictionary Embedding.","authors":"Zhongpai Gao, Junchi Yan, Tianyu Luan, Guangtao Zhai, Xiaokang Yang","doi":"10.1109/TPAMI.2026.3690051","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3690051","url":null,"abstract":"<p><p>Learning mesh representation is important for many 3D tasks. Conventional convolution for regular data (i.e., images) cannot directly be applied to meshes since each vertex's neighbors are unordered. Previous methods use isotropic filters or predefined local coordinate systems or learning weighting matrices for each template vertex to overcome the irregularity. Learning weighting matrices to resample the vertex's neighbors into an implicit canonical order is the most effective way to capture the local structure of each vertex. However, learning weighting matrices for each vertex increases the model size linearly with the vertex number. Thus, large parameters are required for high-resolution 3D shapes, which is not favorable for many applications. In this paper, we learn spectral dictionary (i.e., bases) for the weighting matrices such that the model size is independent of the resolution of 3D shapes. The coefficients of the weighting matrix bases are learned from the spectral features of the template and its hierarchical levels in a weight-sharing manner. Furthermore, we introduce an adaptive sampling method that learns the hierarchical mapping matrices directly to improve the performance without increasing the model size at the inference stage. Comprehensive experiments demonstrate that our model produces state-of-the-art results with a much smaller model size.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书