{"title":"A Comprehensive Survey on Deep Multimodal Learning with Missing Modality","authors":"Renjie Wu, Hu Wang, Hsiang-Ting Chen","doi":"arxiv-2409.07825","DOIUrl":"https://doi.org/arxiv-2409.07825","url":null,"abstract":"During multimodal model training and reasoning, data samples may miss certain\u0000modalities and lead to compromised model performance due to sensor limitations,\u0000cost constraints, privacy concerns, data loss, and temporal and spatial\u0000factors. This survey provides an overview of recent progress in Multimodal\u0000Learning with Missing Modality (MLMM), focusing on deep learning techniques. It\u0000is the first comprehensive survey that covers the historical background and the\u0000distinction between MLMM and standard multimodal learning setups, followed by a\u0000detailed analysis of current MLMM methods, applications, and datasets,\u0000concluding with a discussion about challenges and potential future directions\u0000in the field.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaohuan Lu, Lian Zhao, Wai Keung Wong, Jie Wen, Jiang Long, Wulin Xie
{"title":"Task-Augmented Cross-View Imputation Network for Partial Multi-View Incomplete Multi-Label Classification","authors":"Xiaohuan Lu, Lian Zhao, Wai Keung Wong, Jie Wen, Jiang Long, Wulin Xie","doi":"arxiv-2409.07931","DOIUrl":"https://doi.org/arxiv-2409.07931","url":null,"abstract":"In real-world scenarios, multi-view multi-label learning often encounters the\u0000challenge of incomplete training data due to limitations in data collection and\u0000unreliable annotation processes. The absence of multi-view features impairs the\u0000comprehensive understanding of samples, omitting crucial details essential for\u0000classification. To address this issue, we present a task-augmented cross-view\u0000imputation network (TACVI-Net) for the purpose of handling partial multi-view\u0000incomplete multi-label classification. Specifically, we employ a two-stage\u0000network to derive highly task-relevant features to recover the missing views.\u0000In the first stage, we leverage the information bottleneck theory to obtain a\u0000discriminative representation of each view by extracting task-relevant\u0000information through a view-specific encoder-classifier architecture. In the\u0000second stage, an autoencoder based multi-view reconstruction network is\u0000utilized to extract high-level semantic representation of the augmented\u0000features and recover the missing data, thereby aiding the final classification\u0000task. Extensive experiments on five datasets demonstrate that our TACVI-Net\u0000outperforms other state-of-the-art methods.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martin Thißen, Thi Ngoc Diep Tran, Ben Joel Schönbein, Ute Trapp, Barbara Esteve Ratsch, Beate Egner, Romana Piat, Elke Hergenröther
{"title":"Enhancing Canine Musculoskeletal Diagnoses: Leveraging Synthetic Image Data for Pre-Training AI-Models on Visual Documentations","authors":"Martin Thißen, Thi Ngoc Diep Tran, Ben Joel Schönbein, Ute Trapp, Barbara Esteve Ratsch, Beate Egner, Romana Piat, Elke Hergenröther","doi":"arxiv-2409.08181","DOIUrl":"https://doi.org/arxiv-2409.08181","url":null,"abstract":"The examination of the musculoskeletal system in dogs is a challenging task\u0000in veterinary practice. In this work, a novel method has been developed that\u0000enables efficient documentation of a dog's condition through a visual\u0000representation. However, since the visual documentation is new, there is no\u0000existing training data. The objective of this work is therefore to mitigate the\u0000impact of data scarcity in order to develop an AI-based diagnostic support\u0000system. To this end, the potential of synthetic data that mimics realistic\u0000visual documentations of diseases for pre-training AI models is investigated.\u0000We propose a method for generating synthetic image data that mimics realistic\u0000visual documentations. Initially, a basic dataset containing three distinct\u0000classes is generated, followed by the creation of a more sophisticated dataset\u0000containing 36 different classes. Both datasets are used for the pre-training of\u0000an AI model. Subsequently, an evaluation dataset is created, consisting of 250\u0000manually created visual documentations for five different diseases. This\u0000dataset, along with a subset containing 25 examples. The obtained results on\u0000the evaluation dataset containing 25 examples demonstrate a significant\u0000enhancement of approximately 10% in diagnosis accuracy when utilizing generated\u0000synthetic images that mimic real-world visual documentations. However, these\u0000results do not hold true for the larger evaluation dataset containing 250\u0000examples, indicating that the advantages of using synthetic data for\u0000pre-training an AI model emerge primarily when dealing with few examples of\u0000visual documentations for a given disease. Overall, this work provides valuable\u0000insights into mitigating the limitations imposed by limited training data\u0000through the strategic use of generated synthetic data, presenting an approach\u0000applicable beyond the canine musculoskeletal assessment domain.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oliver Grainge, Michael Milford, Indu Bodala, Sarvapali D. Ramchurn, Shoaib Ehsan
{"title":"Structured Pruning for Efficient Visual Place Recognition","authors":"Oliver Grainge, Michael Milford, Indu Bodala, Sarvapali D. Ramchurn, Shoaib Ehsan","doi":"arxiv-2409.07834","DOIUrl":"https://doi.org/arxiv-2409.07834","url":null,"abstract":"Visual Place Recognition (VPR) is fundamental for the global re-localization\u0000of robots and devices, enabling them to recognize previously visited locations\u0000based on visual inputs. This capability is crucial for maintaining accurate\u0000mapping and localization over large areas. Given that VPR methods need to\u0000operate in real-time on embedded systems, it is critical to optimize these\u0000systems for minimal resource consumption. While the most efficient VPR\u0000approaches employ standard convolutional backbones with fixed descriptor\u0000dimensions, these often lead to redundancy in the embedding space as well as in\u0000the network architecture. Our work introduces a novel structured pruning\u0000method, to not only streamline common VPR architectures but also to\u0000strategically remove redundancies within the feature embedding space. This dual\u0000focus significantly enhances the efficiency of the system, reducing both map\u0000and model memory requirements and decreasing feature extraction and retrieval\u0000latencies. Our approach has reduced memory usage and latency by 21% and 16%,\u0000respectively, across models, while minimally impacting recall@1 accuracy by\u0000less than 1%. This significant improvement enhances real-time applications on\u0000edge devices with negligible accuracy loss.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang
{"title":"SDformer: Efficient End-to-End Transformer for Depth Completion","authors":"Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang","doi":"arxiv-2409.08159","DOIUrl":"https://doi.org/arxiv-2409.08159","url":null,"abstract":"Depth completion aims to predict dense depth maps with sparse depth\u0000measurements from a depth sensor. Currently, Convolutional Neural Network (CNN)\u0000based models are the most popular methods applied to depth completion tasks.\u0000However, despite the excellent high-end performance, they suffer from a limited\u0000representation area. To overcome the drawbacks of CNNs, a more effective and\u0000powerful method has been presented: the Transformer, which is an adaptive\u0000self-attention setting sequence-to-sequence model. While the standard\u0000Transformer quadratically increases the computational cost from the key-query\u0000dot-product of input resolution which improperly employs depth completion\u0000tasks. In this work, we propose a different window-based Transformer\u0000architecture for depth completion tasks named Sparse-to-Dense Transformer\u0000(SDformer). The network consists of an input module for the depth map and RGB\u0000image features extraction and concatenation, a U-shaped encoder-decoder\u0000Transformer for extracting deep features, and a refinement module.\u0000Specifically, we first concatenate the depth map features with the RGB image\u0000features through the input model. Then, instead of calculating self-attention\u0000with the whole feature maps, we apply different window sizes to extract the\u0000long-range depth dependencies. Finally, we refine the predicted features from\u0000the input module and the U-shaped encoder-decoder Transformer module to get the\u0000enriching depth features and employ a convolution layer to obtain the dense\u0000depth map. In practice, the SDformer obtains state-of-the-art results against\u0000the CNN-based depth completion models with lower computing loads and parameters\u0000on the NYU Depth V2 and KITTI DC datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LT3SD: Latent Trees for 3D Scene Diffusion","authors":"Quan Meng, Lei Li, Matthias Nießner, Angela Dai","doi":"arxiv-2409.08215","DOIUrl":"https://doi.org/arxiv-2409.08215","url":null,"abstract":"We present LT3SD, a novel latent diffusion model for large-scale 3D scene\u0000generation. Recent advances in diffusion models have shown impressive results\u0000in 3D object generation, but are limited in spatial extent and quality when\u0000extended to 3D scenes. To generate complex and diverse 3D scene structures, we\u0000introduce a latent tree representation to effectively encode both\u0000lower-frequency geometry and higher-frequency detail in a coarse-to-fine\u0000hierarchy. We can then learn a generative diffusion process in this latent 3D\u0000scene space, modeling the latent components of a scene at each resolution\u0000level. To synthesize large-scale scenes with varying sizes, we train our\u0000diffusion model on scene patches and synthesize arbitrary-sized output 3D\u0000scenes through shared diffusion generation across multiple scene patches.\u0000Through extensive experiments, we demonstrate the efficacy and benefits of\u0000LT3SD for large-scale, high-quality unconditional 3D scene generation and for\u0000probabilistic completion for partial scene observations.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Corentin Sautier, Gilles Puy, Alexandre Boulch, Renaud Marlet, Vincent Lepetit
{"title":"UNIT: Unsupervised Online Instance Segmentation through Time","authors":"Corentin Sautier, Gilles Puy, Alexandre Boulch, Renaud Marlet, Vincent Lepetit","doi":"arxiv-2409.07887","DOIUrl":"https://doi.org/arxiv-2409.07887","url":null,"abstract":"Online object segmentation and tracking in Lidar point clouds enables\u0000autonomous agents to understand their surroundings and make safe decisions.\u0000Unfortunately, manual annotations for these tasks are prohibitively costly. We\u0000tackle this problem with the task of class-agnostic unsupervised online\u0000instance segmentation and tracking. To that end, we leverage an instance\u0000segmentation backbone and propose a new training recipe that enables the online\u0000tracking of objects. Our network is trained on pseudo-labels, eliminating the\u0000need for manual annotations. We conduct an evaluation using metrics adapted for\u0000temporal instance segmentation. Computing these metrics requires\u0000temporally-consistent instance labels. When unavailable, we construct these\u0000labels using the available 3D bounding boxes and semantic labels in the\u0000dataset. We compare our method against strong baselines and demonstrate its\u0000superiority across two different outdoor Lidar datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siyu Chen, Ting Han, Changshe Zhang, Weiquan Liu, Jinhe Su, Zongyue Wang, Guorong Cai
{"title":"Depth Matters: Exploring Deep Interactions of RGB-D for Semantic Segmentation in Traffic Scenes","authors":"Siyu Chen, Ting Han, Changshe Zhang, Weiquan Liu, Jinhe Su, Zongyue Wang, Guorong Cai","doi":"arxiv-2409.07995","DOIUrl":"https://doi.org/arxiv-2409.07995","url":null,"abstract":"RGB-D has gradually become a crucial data source for understanding complex\u0000scenes in assisted driving. However, existing studies have paid insufficient\u0000attention to the intrinsic spatial properties of depth maps. This oversight\u0000significantly impacts the attention representation, leading to prediction\u0000errors caused by attention shift issues. To this end, we propose a novel\u0000learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the\u0000effectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization\u0000(Depth SAO) as offset to represent real-world spatial relationships. Secondly,\u0000the similarity in the feature space of RGB-D is learned by Depth Linear\u0000Cross-Attention (Depth LCA) to clarify spatial differences at the pixel level.\u0000Finally, an MLP Decoder is utilized to effectively fuse multi-scale features\u0000for meeting real-time requirements. Comprehensive experiments demonstrate that\u0000the proposed DiPFormer significantly addresses the issue of attention\u0000misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% /\u0000+1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI\u0000(97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes\u0000(83.4% mIoU) datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. J. Allen, D. Moreno-Fernández, P. Ruiz-Benito, S. W. D. Grieve, E. R. Lines
{"title":"Low-Cost Tree Crown Dieback Estimation Using Deep Learning-Based Segmentation","authors":"M. J. Allen, D. Moreno-Fernández, P. Ruiz-Benito, S. W. D. Grieve, E. R. Lines","doi":"arxiv-2409.08171","DOIUrl":"https://doi.org/arxiv-2409.08171","url":null,"abstract":"The global increase in observed forest dieback, characterised by the death of\u0000tree foliage, heralds widespread decline in forest ecosystems. This degradation\u0000causes significant changes to ecosystem services and functions, including\u0000habitat provision and carbon sequestration, which can be difficult to detect\u0000using traditional monitoring techniques, highlighting the need for large-scale\u0000and high-frequency monitoring. Contemporary developments in the instruments and\u0000methods to gather and process data at large-scales mean this monitoring is now\u0000possible. In particular, the advancement of low-cost drone technology and deep\u0000learning on consumer-level hardware provide new opportunities. Here, we use an\u0000approach based on deep learning and vegetation indices to assess crown dieback\u0000from RGB aerial data without the need for expensive instrumentation such as\u0000LiDAR. We use an iterative approach to match crown footprints predicted by deep\u0000learning with field-based inventory data from a Mediterranean ecosystem\u0000exhibiting drought-induced dieback, and compare expert field-based crown\u0000dieback estimation with vegetation index-based estimates. We obtain high\u0000overall segmentation accuracy (mAP: 0.519) without the need for additional\u0000technical development of the underlying Mask R-CNN model, underscoring the\u0000potential of these approaches for non-expert use and proving their\u0000applicability to real-world conservation. We also find colour-coordinate based\u0000estimates of dieback correlate well with expert field-based estimation.\u0000Substituting ground truth for Mask R-CNN model predictions showed negligible\u0000impact on dieback estimates, indicating robustness. Our findings demonstrate\u0000the potential of automated data collection and processing, including the\u0000application of deep learning, to improve the coverage, speed and cost of forest\u0000dieback monitoring.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EZIGen: Enhancing zero-shot subject-driven image generation with precise subject encoding and decoupled guidance","authors":"Zicheng Duan, Yuxuan Ding, Chenhui Gou, Ziqin Zhou, Ethan Smith, Lingqiao Liu","doi":"arxiv-2409.08091","DOIUrl":"https://doi.org/arxiv-2409.08091","url":null,"abstract":"Zero-shot subject-driven image generation aims to produce images that\u0000incorporate a subject from a given example image. The challenge lies in\u0000preserving the subject's identity while aligning with the text prompt, which\u0000often requires modifying certain aspects of the subject's appearance. Despite\u0000advancements in diffusion model based methods, existing approaches still\u0000struggle to balance identity preservation with text prompt alignment. In this\u0000study, we conducted an in-depth investigation into this issue and uncovered key\u0000insights for achieving effective identity preservation while maintaining a\u0000strong balance. Our key findings include: (1) the design of the subject image\u0000encoder significantly impacts identity preservation quality, and (2) generating\u0000an initial layout is crucial for both text alignment and identity preservation.\u0000Building on these insights, we introduce a new approach called EZIGen, which\u0000employs two main strategies: a carefully crafted subject image Encoder based on\u0000the UNet architecture of the pretrained Stable Diffusion model to ensure\u0000high-quality identity transfer, following a process that decouples the guidance\u0000stages and iteratively refines the initial image layout. Through these\u0000strategies, EZIGen achieves state-of-the-art results on multiple subject-driven\u0000benchmarks with a unified model and 100 times less training data.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}