IEEE transactions on pattern analysis and machine intelligence最新文献

筛选
英文 中文
Towards High-Quality and Disentangled Face Editing in a 3D GAN 面向高质量和无纠缠的三维GAN人脸编辑
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-01-06 DOI: 10.1109/TPAMI.2024.3523422
Kaiwen Jiang;Shu-Yu Chen;Feng-Lin Liu;Hongbo Fu;Lin Gao
{"title":"Towards High-Quality and Disentangled Face Editing in a 3D GAN","authors":"Kaiwen Jiang;Shu-Yu Chen;Feng-Lin Liu;Hongbo Fu;Lin Gao","doi":"10.1109/TPAMI.2024.3523422","DOIUrl":"10.1109/TPAMI.2024.3523422","url":null,"abstract":"Recent methods for synthesizing 3D-aware face images have achieved rapid development thanks to neural radiance fields, allowing for high quality and fast inference speed. However, existing solutions for editing facial geometry and appearance independently usually require retraining and are not optimized for the recent work of generation, thus tending to lag behind the generation process. To address these issues, we introduce NeRFFaceEditing, which enables editing and decoupling geometry and appearance in the pretrained tri-plane-based neural radiance field while retaining its high quality and fast inference speed. Our key idea for disentanglement is to use the statistics of the tri-plane to represent the high-level appearance of its corresponding facial volume. Moreover, we leverage a generated 3D-continuous semantic mask as an intermediary for geometry editing. We devise a geometry decoder (whose output is unchanged when the appearance changes) and an appearance decoder. The geometry decoder aligns the original facial volume with the semantic mask volume. We also enhance the disentanglement by explicitly regularizing rendered images with the same appearance but different geometry to be similar in terms of color distribution for each facial component separately. Our method allows users to edit via semantic masks with decoupled control of geometry and appearance. Both qualitative and quantitative evaluations show the superior geometry and appearance control abilities of our method compared to existing and alternative solutions.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2533-2544"},"PeriodicalIF":0.0,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142934569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Instruction-Guided Scene Text Recognition 指导场景文本识别
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-01-03 DOI: 10.1109/TPAMI.2025.3525526
Yongkun Du;Zhineng Chen;Yuchen Su;Caiyan Jia;Yu-Gang Jiang
{"title":"Instruction-Guided Scene Text Recognition","authors":"Yongkun Du;Zhineng Chen;Yuchen Su;Caiyan Jia;Yu-Gang Jiang","doi":"10.1109/TPAMI.2025.3525526","DOIUrl":"10.1109/TPAMI.2025.3525526","url":null,"abstract":"Multi-modal models have shown appealing performance in visual recognition tasks, as free-form text-guided training evokes the ability to understand fine-grained visual content. However, current models cannot be trivially applied to scene text recognition (STR) due to the compositional difference between natural and text images. We propose a novel instruction-guided scene text recognition (IGTR) paradigm that formulates STR as an instruction learning problem and understands text images by predicting character attributes, e.g., character frequency, position, etc. IGTR first devises <inline-formula><tex-math>$leftlangle condition,question,answerrightrangle$</tex-math></inline-formula> instruction triplets, providing rich and diverse descriptions of character attributes. To effectively learn these attributes through question-answering, IGTR develops a lightweight instruction encoder, a cross-modal feature fusion module and a multi-task answer head, which guides nuanced text image understanding. Furthermore, IGTR realizes different recognition pipelines simply by using different instructions, enabling a character-understanding-based text reasoning paradigm that differs from current methods considerably. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins, while maintaining a small model size and fast inference speed. Moreover, by adjusting the sampling of instructions, IGTR offers an elegant way to tackle the recognition of rarely appearing and morphologically similar characters, which were previous challenges.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2723-2738"},"PeriodicalIF":0.0,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142924715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalized Task-Driven Medical Image Quality Enhancement With Gradient Promotion 基于梯度提升的广义任务驱动医学图像质量增强
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-01-03 DOI: 10.1109/TPAMI.2025.3525671
Dong Zhang;Kwang-Ting Cheng
{"title":"Generalized Task-Driven Medical Image Quality Enhancement With Gradient Promotion","authors":"Dong Zhang;Kwang-Ting Cheng","doi":"10.1109/TPAMI.2025.3525671","DOIUrl":"10.1109/TPAMI.2025.3525671","url":null,"abstract":"Thanks to the recent achievements in task-driven image quality enhancement (IQE) models like ESTR (Liu et al. 2023), the image enhancement model and the visual recognition model can mutually enhance each other's quantitation while producing high-quality processed images that are perceivable by our human vision systems. However, existing task-driven IQE models tend to overlook an underlying fact–different levels of vision tasks have varying and sometimes conflicting requirements of image features. To address this problem, this paper proposes a generalized gradient promotion (<italic>GradProm</i>) training strategy for task-driven IQE of medical images. Specifically, we partition a task-driven IQE system into two sub-models, i.e., a mainstream model for image enhancement and an auxiliary model for visual recognition. During training, <italic>GradProm</i> updates only parameters of the image enhancement model using gradients of the visual recognition model and the image enhancement model, but only when gradients of these two sub-models are aligned in the same direction, which is measured by their cosine similarity. In case gradients of these two sub-models are not in the same direction, <italic>GradProm</i> only uses the gradient of the image enhancement model to update its parameters. Theoretically, we have proved that the optimization direction of the image enhancement model will not be biased by the auxiliary visual recognition model under the implementation of <italic>GradProm</i>. Empirically, extensive experimental results on four public yet challenging medical image datasets demonstrated the superior performance of <italic>GradProm</i> over existing state-of-the-art methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2785-2798"},"PeriodicalIF":0.0,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142924714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeepSN-Net: Deep Semi-Smooth Newton Driven Network for Blind Image Restoration DeepSN-Net:用于盲图像恢复的深度半光滑牛顿驱动网络
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-01-02 DOI: 10.1109/TPAMI.2024.3525089
Xin Deng;Chenxiao Zhang;Lai Jiang;Jingyuan Xia;Mai Xu
{"title":"DeepSN-Net: Deep Semi-Smooth Newton Driven Network for Blind Image Restoration","authors":"Xin Deng;Chenxiao Zhang;Lai Jiang;Jingyuan Xia;Mai Xu","doi":"10.1109/TPAMI.2024.3525089","DOIUrl":"10.1109/TPAMI.2024.3525089","url":null,"abstract":"The deep unfolding network represents a promising research avenue in image restoration. However, most current deep unfolding methodologies are anchored in first-order optimization algorithms, which suffer from sluggish convergence speed and unsatisfactory learning efficiency. In this paper, to address this issue, we first formulate an improved second-order semi-smooth Newton (ISN) algorithm, transforming the original nonlinear equations into an optimization problem amenable to network implementation. After that, we propose an innovative network architecture based on the ISN algorithm for blind image restoration, namely DeepSN-Net. To the best of our knowledge, DeepSN-Net is the first successful endeavor to design a second-order deep unfolding network for image restoration, which fills the blank of this area. Furthermore, it offers several distinct advantages: 1) DeepSN-Net provides a unified framework to a variety of image restoration tasks in both synthetic and real-world contexts, without imposing constraints on the degradation conditions. 2) The network architecture is meticulously aligned with the ISN algorithm, ensuring that each module possesses robust physical interpretability. 3) The network exhibits high learning efficiency, superior restoration accuracy and good generalization ability across 11 datasets on three typical restoration tasks. The success of DeepSN-Net on image restoration may ignite many subsequent works centered around the second-order optimization algorithms, which is good for the community.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2632-2646"},"PeriodicalIF":0.0,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142917151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpretable Optimization-Inspired Unfolding Network for Low-Light Image Enhancement 基于可解释优化的弱光图像增强展开网络
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-01-01 DOI: 10.1109/TPAMI.2024.3524538
Wenhui Wu;Jian Weng;Pingping Zhang;Xu Wang;Wenhan Yang;Jianmin Jiang
{"title":"Interpretable Optimization-Inspired Unfolding Network for Low-Light Image Enhancement","authors":"Wenhui Wu;Jian Weng;Pingping Zhang;Xu Wang;Wenhan Yang;Jianmin Jiang","doi":"10.1109/TPAMI.2024.3524538","DOIUrl":"10.1109/TPAMI.2024.3524538","url":null,"abstract":"Retinex model-based methods have shown to be effective in layer-wise manipulation with well-designed priors for low-light image enhancement (LLIE). However, the hand-crafted priors and conventional optimization algorithm adopted to solve the layer decomposition problem result in the lack of adaptivity and efficiency. To this end, this paper proposes a Retinex-based deep unfolding network (URetinex-Net++), which unfolds an optimization problem into a learnable network to decompose a low-light image into reflectance and illumination layers. By formulating the decomposition problem as an implicit priors regularized model, three learning-based modules are carefully designed, responsible for data-dependent initialization, high-efficient unfolding optimization, and fairly-flexible component adjustment, respectively. Particularly, the proposed unfolding optimization module, introducing two networks to adaptively fit implicit priors in the data-driven manner, can realize noise suppression and details preservation for decomposed components. URetinex-Net++ is a further augmented version of URetinex-Net, which introduces a cross-stage fusion block to alleviate the color defect in URetinex-Net. Therefore, boosted performance on LLIE can be obtained in both visual quality and quantitative metrics, where only a few parameters are introduced and little time is cost. Extensive experiments on real-world low-light images qualitatively and quantitatively demonstrate the effectiveness and superiority of the proposed URetinex-Net++ over state-of-the-art methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2545-2562"},"PeriodicalIF":0.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142911797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Filter Pruning by High-Order Spectral Clustering 基于高阶谱聚类的滤波剪枝
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-12-31 DOI: 10.1109/TPAMI.2024.3524381
Hang Lin;Yifan Peng;Yubo Zhang;Lin Bie;Xibin Zhao;Yue Gao
{"title":"Filter Pruning by High-Order Spectral Clustering","authors":"Hang Lin;Yifan Peng;Yubo Zhang;Lin Bie;Xibin Zhao;Yue Gao","doi":"10.1109/TPAMI.2024.3524381","DOIUrl":"10.1109/TPAMI.2024.3524381","url":null,"abstract":"Large amount of redundancy is widely present in convolutional neural networks (CNNs). Identifying the redundancy in the network and removing the redundant filters is an effective way to compress the CNN model size with a minimal reduction in performance. However, most of the existing redundancy-based pruning methods only consider the distance information between two filters, which can only model simple correlations between filters. Moreover, we point out that distance-based pruning methods are not applicable for high-dimensional features in CNN models by our experimental observations and analysis. To tackle this issue, we propose a new pruning strategy based on high-order spectral clustering. In this approach, we use hypergraph structure to construct complex correlations among filters, and obtain high-order information among filters by hypergraph structure learning. Finally, based on the high-order information, we can perform better clustering on the filters and remove the redundant filters in each cluster. Experiments on various CNN models and datasets demonstrate that our proposed method outperforms the recent state-of-the-art works. For example, with ResNet50, we achieve a 57.1% FLOPs reduction with no accuracy drop on ImageNet, which is the first to achieve lossless pruning with such a high compression ratio.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2402-2415"},"PeriodicalIF":0.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142908392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Glissando-Net: Deep Single View Category Level Pose Estimation and 3D Reconstruction Glissando-Net:深度单视图分类级姿态估计和三维重建
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-12-31 DOI: 10.1109/TPAMI.2024.3519674
Bo Sun;Hao Kang;Li Guan;Haoxiang Li;Philippos Mordohai;Gang Hua
{"title":"Glissando-Net: Deep Single View Category Level Pose Estimation and 3D Reconstruction","authors":"Bo Sun;Hao Kang;Li Guan;Haoxiang Li;Philippos Mordohai;Gang Hua","doi":"10.1109/TPAMI.2024.3519674","DOIUrl":"10.1109/TPAMI.2024.3519674","url":null,"abstract":"We present a deep learning model, dubbed Glissando-Net, to simultaneously estimate the pose and reconstruct the 3D shape of objects at the category level from a single RGB image. Previous works predominantly focused on either estimating poses (often at the instance level), or reconstructing shapes, but not both. Glissando-Net is composed of two auto-encoders that are jointly trained, one for RGB images and the other for point clouds. We embrace two key design choices in Glissando-Net to achieve a more accurate prediction of the 3D shape and pose of the object given a single RGB image as input. First, we augment the feature maps of the point cloud encoder and decoder with transformed feature maps from the image decoder, enabling effective 2D-3D interaction in both training and prediction. Second, we predict both the 3D shape and pose of the object in the decoder stage. This way, we better utilize the information in the 3D point clouds presented only in the training stage to train the network for more accurate prediction. We jointly train the two encoder-decoders for RGB and point cloud data to learn how to pass latent features to the point cloud decoder during inference. In testing, the encoder of the 3D point cloud is discarded. The design of Glissando-Net is inspired by codeSLAM. Unlike codeSLAM, which targets 3D reconstruction of scenes, we focus on pose estimation and shape reconstruction of objects, and directly predict the object pose and a pose invariant 3D reconstruction without the need of the code optimization step. Extensive experiments, involving both ablation studies and comparison with competing methods, demonstrate the efficacy of our proposed method, and compare favorably with the state-of-the-art.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2298-2312"},"PeriodicalIF":0.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142908424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
JM3D & JM3D-LLM: Elevating 3D Representation With Joint Multi-Modal Cues JM3D & JM3D- llm:用联合多模态线索提升3D表示
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-12-30 DOI: 10.1109/TPAMI.2024.3523675
Jiayi Ji;Haowei Wang;Changli Wu;Yiwei Ma;Xiaoshuai Sun;Rongrong Ji
{"title":"JM3D & JM3D-LLM: Elevating 3D Representation With Joint Multi-Modal Cues","authors":"Jiayi Ji;Haowei Wang;Changli Wu;Yiwei Ma;Xiaoshuai Sun;Rongrong Ji","doi":"10.1109/TPAMI.2024.3523675","DOIUrl":"10.1109/TPAMI.2024.3523675","url":null,"abstract":"The rising importance of 3D representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2475-2492"},"PeriodicalIF":0.0,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142905473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Signed Graph Sampling via Balancing & Gershgorin Disc Perfect Alignment 通过平衡和Gershgorin圆盘完美对齐高效签名图采样
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-12-30 DOI: 10.1109/TPAMI.2024.3524180
Chinthaka Dinesh;Gene Cheung;Saghar Bagheri;Ivan V. Bajić
{"title":"Efficient Signed Graph Sampling via Balancing & Gershgorin Disc Perfect Alignment","authors":"Chinthaka Dinesh;Gene Cheung;Saghar Bagheri;Ivan V. Bajić","doi":"10.1109/TPAMI.2024.3524180","DOIUrl":"10.1109/TPAMI.2024.3524180","url":null,"abstract":"A basic premise in graph signal processing (GSP) is that a graph encoding pairwise (anti-)correlations of the targeted signal as edge weights is leveraged for graph filtering. Existing fast graph sampling schemes are designed and tested only for positive graphs describing positive correlations. However, there are many real-world datasets exhibiting strong anti-correlations, and thus a suitable model is a signed graph, containing both positive and negative edge weights. In this paper, we propose the first linear-time method for sampling signed graphs, centered on the concept of balanced signed graphs. Specifically, given an empirical covariance data matrix <inline-formula><tex-math>$bar{{mathbf {C}}}$</tex-math></inline-formula>, we first learn a sparse inverse matrix <inline-formula><tex-math>${mathcal {L}}$</tex-math></inline-formula>, interpreted as a graph Laplacian corresponding to a signed graph <inline-formula><tex-math>${mathcal {G}}$</tex-math></inline-formula>. We approximate <inline-formula><tex-math>${mathcal {G}}$</tex-math></inline-formula> with a balanced signed graph <inline-formula><tex-math>${mathcal {G}}^{b}$</tex-math></inline-formula> via fast edge weight augmentation in linear time, where the eigenpairs of Laplacian <inline-formula><tex-math>${mathcal {L}}^{b}$</tex-math></inline-formula> for <inline-formula><tex-math>${mathcal {G}}^{b}$</tex-math></inline-formula> are graph frequencies. Next, we select a node subset for sampling to minimize the error of the signal interpolated from samples in two steps. We first align all Gershgorin disc left-ends of Laplacian <inline-formula><tex-math>${mathcal {L}}^{b}$</tex-math></inline-formula> at the smallest eigenvalue <inline-formula><tex-math>$lambda _{min }({mathcal {L}}^{b})$</tex-math></inline-formula> via similarity transform <inline-formula><tex-math>${mathcal {L}}^{s} = {mathbf {S}}{mathcal {L}}^{b} {mathbf {S}}^{-1}$</tex-math></inline-formula>, leveraging a recent linear algebra theorem called Gershgorin disc perfect alignment (GDPA). We then perform sampling on <inline-formula><tex-math>${mathcal {L}}^{s}$</tex-math></inline-formula> using a previous fast Gershgorin disc alignment sampling (GDAS) scheme. Experiments show that our signed graph sampling method outperformed fast sampling schemes designed for positive graphs on various datasets with anti-correlations.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2330-2348"},"PeriodicalIF":0.0,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142905470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hyper-YOLO: When Visual Object Detection Meets Hypergraph Computation Hyper-YOLO:当视觉对象检测与超图计算相结合
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-12-30 DOI: 10.1109/TPAMI.2024.3524377
Yifan Feng;Jiangang Huang;Shaoyi Du;Shihui Ying;Jun-Hai Yong;Yipeng Li;Guiguang Ding;Rongrong Ji;Yue Gao
{"title":"Hyper-YOLO: When Visual Object Detection Meets Hypergraph Computation","authors":"Yifan Feng;Jiangang Huang;Shaoyi Du;Shihui Ying;Jun-Hai Yong;Yipeng Li;Guiguang Ding;Rongrong Ji;Yue Gao","doi":"10.1109/TPAMI.2024.3524377","DOIUrl":"10.1109/TPAMI.2024.3524377","url":null,"abstract":"We introduce Hyper-YOLO, a new object detection method that integrates hypergraph computations to capture the complex high-order correlations among visual features. Traditional YOLO models, while powerful, have limitations in their neck designs that restrict the integration of cross-level features and the exploitation of high-order feature interrelationships. To address these challenges, we propose the Hypergraph Computation Empowered Semantic Collecting and Scattering (HGC-SCS) framework, which transposes visual feature maps into a semantic space and constructs a hypergraph for high-order message propagation. This enables the model to acquire both semantic and structural information, advancing beyond conventional feature-focused learning. Hyper-YOLO incorporates the proposed Mixed Aggregation Network (MANet) in its backbone for enhanced feature extraction and introduces the Hypergraph-Based Cross-Level and Cross-Position Representation Network (HyperC2Net) in its neck. HyperC2Net operates across five scales and breaks free from traditional grid structures, allowing for sophisticated high-order interactions across levels and positions. This synergy of components positions Hyper-YOLO as a state-of-the-art architecture in various scale models, as evidenced by its superior performance on the COCO dataset. Specifically, Hyper-YOLO-N significantly outperforms the advanced YOLOv8-N and YOLOv9-T with 12% <inline-formula><tex-math>$text{AP}^{val}$</tex-math></inline-formula> and 9% <inline-formula><tex-math>$text{AP}^{val}$</tex-math></inline-formula> improvements.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2388-2401"},"PeriodicalIF":0.0,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142905471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信