arXiv - CS - Computer Vision and Pattern Recognition最新文献_第8页

A Comprehensive Survey on Deep Multimodal Learning with Missing Modality 关于缺失模态深度多模态学习的全面调查

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.07825

Renjie Wu, Hu Wang, Hsiang-Ting Chen

引用次数: 0

Task-Augmented Cross-View Imputation Network for Partial Multi-View Incomplete Multi-Label Classification 用于部分多视图不完整多标签分类的任务增强型交叉视图估算网络

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.07931

Xiaohuan Lu, Lian Zhao, Wai Keung Wong, Jie Wen, Jiang Long, Wulin Xie

{"title":"Task-Augmented Cross-View Imputation Network for Partial Multi-View Incomplete Multi-Label Classification","authors":"Xiaohuan Lu, Lian Zhao, Wai Keung Wong, Jie Wen, Jiang Long, Wulin Xie","doi":"arxiv-2409.07931","DOIUrl":"https://doi.org/arxiv-2409.07931","url":null,"abstract":"In real-world scenarios, multi-view multi-label learning often encounters the\u0000challenge of incomplete training data due to limitations in data collection and\u0000unreliable annotation processes. The absence of multi-view features impairs the\u0000comprehensive understanding of samples, omitting crucial details essential for\u0000classification. To address this issue, we present a task-augmented cross-view\u0000imputation network (TACVI-Net) for the purpose of handling partial multi-view\u0000incomplete multi-label classification. Specifically, we employ a two-stage\u0000network to derive highly task-relevant features to recover the missing views.\u0000In the first stage, we leverage the information bottleneck theory to obtain a\u0000discriminative representation of each view by extracting task-relevant\u0000information through a view-specific encoder-classifier architecture. In the\u0000second stage, an autoencoder based multi-view reconstruction network is\u0000utilized to extract high-level semantic representation of the augmented\u0000features and recover the missing data, thereby aiding the final classification\u0000task. Extensive experiments on five datasets demonstrate that our TACVI-Net\u0000outperforms other state-of-the-art methods.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Canine Musculoskeletal Diagnoses: Leveraging Synthetic Image Data for Pre-Training AI-Models on Visual Documentations 加强犬类肌肉骨骼诊断：利用合成图像数据预训练视觉文档上的人工智能模型

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08181

Martin Thißen, Thi Ngoc Diep Tran, Ben Joel Schönbein, Ute Trapp, Barbara Esteve Ratsch, Beate Egner, Romana Piat, Elke Hergenröther

{"title":"Enhancing Canine Musculoskeletal Diagnoses: Leveraging Synthetic Image Data for Pre-Training AI-Models on Visual Documentations","authors":"Martin Thißen, Thi Ngoc Diep Tran, Ben Joel Schönbein, Ute Trapp, Barbara Esteve Ratsch, Beate Egner, Romana Piat, Elke Hergenröther","doi":"arxiv-2409.08181","DOIUrl":"https://doi.org/arxiv-2409.08181","url":null,"abstract":"The examination of the musculoskeletal system in dogs is a challenging task\u0000in veterinary practice. In this work, a novel method has been developed that\u0000enables efficient documentation of a dog's condition through a visual\u0000representation. However, since the visual documentation is new, there is no\u0000existing training data. The objective of this work is therefore to mitigate the\u0000impact of data scarcity in order to develop an AI-based diagnostic support\u0000system. To this end, the potential of synthetic data that mimics realistic\u0000visual documentations of diseases for pre-training AI models is investigated.\u0000We propose a method for generating synthetic image data that mimics realistic\u0000visual documentations. Initially, a basic dataset containing three distinct\u0000classes is generated, followed by the creation of a more sophisticated dataset\u0000containing 36 different classes. Both datasets are used for the pre-training of\u0000an AI model. Subsequently, an evaluation dataset is created, consisting of 250\u0000manually created visual documentations for five different diseases. This\u0000dataset, along with a subset containing 25 examples. The obtained results on\u0000the evaluation dataset containing 25 examples demonstrate a significant\u0000enhancement of approximately 10% in diagnosis accuracy when utilizing generated\u0000synthetic images that mimic real-world visual documentations. However, these\u0000results do not hold true for the larger evaluation dataset containing 250\u0000examples, indicating that the advantages of using synthetic data for\u0000pre-training an AI model emerge primarily when dealing with few examples of\u0000visual documentations for a given disease. Overall, this work provides valuable\u0000insights into mitigating the limitations imposed by limited training data\u0000through the strategic use of generated synthetic data, presenting an approach\u0000applicable beyond the canine musculoskeletal assessment domain.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Structured Pruning for Efficient Visual Place Recognition 高效视觉地点识别的结构化修剪

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.07834

Oliver Grainge, Michael Milford, Indu Bodala, Sarvapali D. Ramchurn, Shoaib Ehsan

{"title":"Structured Pruning for Efficient Visual Place Recognition","authors":"Oliver Grainge, Michael Milford, Indu Bodala, Sarvapali D. Ramchurn, Shoaib Ehsan","doi":"arxiv-2409.07834","DOIUrl":"https://doi.org/arxiv-2409.07834","url":null,"abstract":"Visual Place Recognition (VPR) is fundamental for the global re-localization\u0000of robots and devices, enabling them to recognize previously visited locations\u0000based on visual inputs. This capability is crucial for maintaining accurate\u0000mapping and localization over large areas. Given that VPR methods need to\u0000operate in real-time on embedded systems, it is critical to optimize these\u0000systems for minimal resource consumption. While the most efficient VPR\u0000approaches employ standard convolutional backbones with fixed descriptor\u0000dimensions, these often lead to redundancy in the embedding space as well as in\u0000the network architecture. Our work introduces a novel structured pruning\u0000method, to not only streamline common VPR architectures but also to\u0000strategically remove redundancies within the feature embedding space. This dual\u0000focus significantly enhances the efficiency of the system, reducing both map\u0000and model memory requirements and decreasing feature extraction and retrieval\u0000latencies. Our approach has reduced memory usage and latency by 21% and 16%,\u0000respectively, across models, while minimally impacting recall@1 accuracy by\u0000less than 1%. This significant improvement enhances real-time applications on\u0000edge devices with negligible accuracy loss.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SDformer: Efficient End-to-End Transformer for Depth Completion SDformer：用于深度补全的高效端到端变换器

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08159

Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang

{"title":"SDformer: Efficient End-to-End Transformer for Depth Completion","authors":"Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang","doi":"arxiv-2409.08159","DOIUrl":"https://doi.org/arxiv-2409.08159","url":null,"abstract":"Depth completion aims to predict dense depth maps with sparse depth\u0000measurements from a depth sensor. Currently, Convolutional Neural Network (CNN)\u0000based models are the most popular methods applied to depth completion tasks.\u0000However, despite the excellent high-end performance, they suffer from a limited\u0000representation area. To overcome the drawbacks of CNNs, a more effective and\u0000powerful method has been presented: the Transformer, which is an adaptive\u0000self-attention setting sequence-to-sequence model. While the standard\u0000Transformer quadratically increases the computational cost from the key-query\u0000dot-product of input resolution which improperly employs depth completion\u0000tasks. In this work, we propose a different window-based Transformer\u0000architecture for depth completion tasks named Sparse-to-Dense Transformer\u0000(SDformer). The network consists of an input module for the depth map and RGB\u0000image features extraction and concatenation, a U-shaped encoder-decoder\u0000Transformer for extracting deep features, and a refinement module.\u0000Specifically, we first concatenate the depth map features with the RGB image\u0000features through the input model. Then, instead of calculating self-attention\u0000with the whole feature maps, we apply different window sizes to extract the\u0000long-range depth dependencies. Finally, we refine the predicted features from\u0000the input module and the U-shaped encoder-decoder Transformer module to get the\u0000enriching depth features and employ a convolution layer to obtain the dense\u0000depth map. In practice, the SDformer obtains state-of-the-art results against\u0000the CNN-based depth completion models with lower computing loads and parameters\u0000on the NYU Depth V2 and KITTI DC datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LT3SD: Latent Trees for 3D Scene Diffusion LT3SD：用于三维场景扩散的潜影树

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08215

Quan Meng, Lei Li, Matthias Nießner, Angela Dai

引用次数: 0

UNIT: Unsupervised Online Instance Segmentation through Time 单元：通过时间进行无监督在线实例分割

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.07887

Corentin Sautier, Gilles Puy, Alexandre Boulch, Renaud Marlet, Vincent Lepetit

引用次数: 0

Depth Matters: Exploring Deep Interactions of RGB-D for Semantic Segmentation in Traffic Scenes 深度很重要：探索 RGB-D 的深度交互，实现交通场景中的语义分割

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.07995

Siyu Chen, Ting Han, Changshe Zhang, Weiquan Liu, Jinhe Su, Zongyue Wang, Guorong Cai

{"title":"Depth Matters: Exploring Deep Interactions of RGB-D for Semantic Segmentation in Traffic Scenes","authors":"Siyu Chen, Ting Han, Changshe Zhang, Weiquan Liu, Jinhe Su, Zongyue Wang, Guorong Cai","doi":"arxiv-2409.07995","DOIUrl":"https://doi.org/arxiv-2409.07995","url":null,"abstract":"RGB-D has gradually become a crucial data source for understanding complex\u0000scenes in assisted driving. However, existing studies have paid insufficient\u0000attention to the intrinsic spatial properties of depth maps. This oversight\u0000significantly impacts the attention representation, leading to prediction\u0000errors caused by attention shift issues. To this end, we propose a novel\u0000learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the\u0000effectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization\u0000(Depth SAO) as offset to represent real-world spatial relationships. Secondly,\u0000the similarity in the feature space of RGB-D is learned by Depth Linear\u0000Cross-Attention (Depth LCA) to clarify spatial differences at the pixel level.\u0000Finally, an MLP Decoder is utilized to effectively fuse multi-scale features\u0000for meeting real-time requirements. Comprehensive experiments demonstrate that\u0000the proposed DiPFormer significantly addresses the issue of attention\u0000misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% /\u0000+1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI\u0000(97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes\u0000(83.4% mIoU) datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"433 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Low-Cost Tree Crown Dieback Estimation Using Deep Learning-Based Segmentation 利用基于深度学习的分割技术进行低成本树冠枯萎估算

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08171

M. J. Allen, D. Moreno-Fernández, P. Ruiz-Benito, S. W. D. Grieve, E. R. Lines

{"title":"Low-Cost Tree Crown Dieback Estimation Using Deep Learning-Based Segmentation","authors":"M. J. Allen, D. Moreno-Fernández, P. Ruiz-Benito, S. W. D. Grieve, E. R. Lines","doi":"arxiv-2409.08171","DOIUrl":"https://doi.org/arxiv-2409.08171","url":null,"abstract":"The global increase in observed forest dieback, characterised by the death of\u0000tree foliage, heralds widespread decline in forest ecosystems. This degradation\u0000causes significant changes to ecosystem services and functions, including\u0000habitat provision and carbon sequestration, which can be difficult to detect\u0000using traditional monitoring techniques, highlighting the need for large-scale\u0000and high-frequency monitoring. Contemporary developments in the instruments and\u0000methods to gather and process data at large-scales mean this monitoring is now\u0000possible. In particular, the advancement of low-cost drone technology and deep\u0000learning on consumer-level hardware provide new opportunities. Here, we use an\u0000approach based on deep learning and vegetation indices to assess crown dieback\u0000from RGB aerial data without the need for expensive instrumentation such as\u0000LiDAR. We use an iterative approach to match crown footprints predicted by deep\u0000learning with field-based inventory data from a Mediterranean ecosystem\u0000exhibiting drought-induced dieback, and compare expert field-based crown\u0000dieback estimation with vegetation index-based estimates. We obtain high\u0000overall segmentation accuracy (mAP: 0.519) without the need for additional\u0000technical development of the underlying Mask R-CNN model, underscoring the\u0000potential of these approaches for non-expert use and proving their\u0000applicability to real-world conservation. We also find colour-coordinate based\u0000estimates of dieback correlate well with expert field-based estimation.\u0000Substituting ground truth for Mask R-CNN model predictions showed negligible\u0000impact on dieback estimates, indicating robustness. Our findings demonstrate\u0000the potential of automated data collection and processing, including the\u0000application of deep learning, to improve the coverage, speed and cost of forest\u0000dieback monitoring.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"112 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EZIGen: Enhancing zero-shot subject-driven image generation with precise subject encoding and decoupled guidance EZIGen：通过精确的主体编码和解耦引导，增强零镜头主体驱动图像生成功能

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08091

Zicheng Duan, Yuxuan Ding, Chenhui Gou, Ziqin Zhou, Ethan Smith, Lingqiao Liu

{"title":"EZIGen: Enhancing zero-shot subject-driven image generation with precise subject encoding and decoupled guidance","authors":"Zicheng Duan, Yuxuan Ding, Chenhui Gou, Ziqin Zhou, Ethan Smith, Lingqiao Liu","doi":"arxiv-2409.08091","DOIUrl":"https://doi.org/arxiv-2409.08091","url":null,"abstract":"Zero-shot subject-driven image generation aims to produce images that\u0000incorporate a subject from a given example image. The challenge lies in\u0000preserving the subject's identity while aligning with the text prompt, which\u0000often requires modifying certain aspects of the subject's appearance. Despite\u0000advancements in diffusion model based methods, existing approaches still\u0000struggle to balance identity preservation with text prompt alignment. In this\u0000study, we conducted an in-depth investigation into this issue and uncovered key\u0000insights for achieving effective identity preservation while maintaining a\u0000strong balance. Our key findings include: (1) the design of the subject image\u0000encoder significantly impacts identity preservation quality, and (2) generating\u0000an initial layout is crucial for both text alignment and identity preservation.\u0000Building on these insights, we introduce a new approach called EZIGen, which\u0000employs two main strategies: a carefully crafted subject image Encoder based on\u0000the UNet architecture of the pretrained Stable Diffusion model to ensure\u0000high-quality identity transfer, following a process that decouples the guidance\u0000stages and iteratively refines the initial image layout. Through these\u0000strategies, EZIGen achieves state-of-the-art results on multiple subject-driven\u0000benchmarks with a unified model and 100 times less training data.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0