Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Ziwei Liu, Qifeng Chen, Zhaoxiang Zhang
{"title":"SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality","authors":"Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Ziwei Liu, Qifeng Chen, Zhaoxiang Zhang","doi":"arxiv-2409.08083","DOIUrl":"https://doi.org/arxiv-2409.08083","url":null,"abstract":"Foundation models like ChatGPT and Sora that are trained on a huge scale of\u0000data have made a revolutionary social impact. However, it is extremely\u0000challenging for sensors in many different fields to collect similar scales of\u0000natural images to train strong foundation models. To this end, this work\u0000presents a simple and effective framework SimMAT to study an open problem: the\u0000transferability from vision foundation models trained on natural RGB images to\u0000other image modalities of different physical properties (e.g., polarization).\u0000SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained\u0000foundation model. We apply SimMAT to a representative vision foundation model\u0000Segment Anything Model (SAM) to support any evaluated new image modality. Given\u0000the absence of relevant benchmarks, we construct a new benchmark to evaluate\u0000the transfer learning performance. Our experiments confirm the intriguing\u0000potential of transferring vision foundation models in enhancing other sensors'\u0000performance. Specifically, SimMAT can improve the segmentation performance\u0000(mIoU) from 22.15% to 53.88% on average for evaluated modalities and\u0000consistently outperforms other baselines. We hope that SimMAT can raise\u0000awareness of cross-modal transfer learning and benefit various fields for\u0000better results with vision foundation models.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fatemeh Askari, Amirreza Fateh, Mohammad Reza Mohammadi
{"title":"Enhancing Few-Shot Image Classification through Learnable Multi-Scale Embedding and Attention Mechanisms","authors":"Fatemeh Askari, Amirreza Fateh, Mohammad Reza Mohammadi","doi":"arxiv-2409.07989","DOIUrl":"https://doi.org/arxiv-2409.07989","url":null,"abstract":"In the context of few-shot classification, the goal is to train a classifier\u0000using a limited number of samples while maintaining satisfactory performance.\u0000However, traditional metric-based methods exhibit certain limitations in\u0000achieving this objective. These methods typically rely on a single distance\u0000value between the query feature and support feature, thereby overlooking the\u0000contribution of shallow features. To overcome this challenge, we propose a\u0000novel approach in this paper. Our approach involves utilizing multi-output\u0000embedding network that maps samples into distinct feature spaces. The proposed\u0000method extract feature vectors at different stages, enabling the model to\u0000capture both global and abstract features. By utilizing these diverse feature\u0000spaces, our model enhances its performance. Moreover, employing a\u0000self-attention mechanism improves the refinement of features at each stage,\u0000leading to even more robust representations and improved overall performance.\u0000Furthermore, assigning learnable weights to each stage significantly improved\u0000performance and results. We conducted comprehensive evaluations on the\u0000MiniImageNet and FC100 datasets, specifically in the 5-way 1-shot and 5-way\u00005-shot scenarios. Additionally, we performed a cross-domain task from\u0000MiniImageNet to the CUB dataset, achieving high accuracy in the testing domain.\u0000These evaluations demonstrate the efficacy of our proposed method in comparison\u0000to state-of-the-art approaches. https://github.com/FatemehAskari/MSENet","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder","authors":"NaHyeon Park, Kunhee Kim, Hyunjung Shim","doi":"arxiv-2409.08248","DOIUrl":"https://doi.org/arxiv-2409.08248","url":null,"abstract":"Recent breakthroughs in text-to-image models have opened up promising\u0000research avenues in personalized image generation, enabling users to create\u0000diverse images of a specific subject using natural language prompts. However,\u0000existing methods often suffer from performance degradation when given only a\u0000single reference image. They tend to overfit the input, producing highly\u0000similar outputs regardless of the text prompt. This paper addresses the\u0000challenge of one-shot personalization by mitigating overfitting, enabling the\u0000creation of controllable images through text prompts. Specifically, we propose\u0000a selective fine-tuning strategy that focuses on the text encoder. Furthermore,\u0000we introduce three key techniques to enhance personalization performance: (1)\u0000augmentation tokens to encourage feature disentanglement and alleviate\u0000overfitting, (2) a knowledge-preservation loss to reduce language drift and\u0000promote generalizability across diverse prompts, and (3) SNR-weighted sampling\u0000for efficient training. Extensive experiments demonstrate that our approach\u0000efficiently generates high-quality, diverse images using only a single\u0000reference image while significantly reducing memory and storage requirements.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kerem Cekmeceli, Meva Himmetoglu, Guney I. Tombak, Anna Susmelj, Ertunc Erdil, Ender Konukoglu
{"title":"Do Vision Foundation Models Enhance Domain Generalization in Medical Image Segmentation?","authors":"Kerem Cekmeceli, Meva Himmetoglu, Guney I. Tombak, Anna Susmelj, Ertunc Erdil, Ender Konukoglu","doi":"arxiv-2409.07960","DOIUrl":"https://doi.org/arxiv-2409.07960","url":null,"abstract":"Neural networks achieve state-of-the-art performance in many supervised\u0000learning tasks when the training data distribution matches the test data\u0000distribution. However, their performance drops significantly under domain\u0000(covariate) shift, a prevalent issue in medical image segmentation due to\u0000varying acquisition settings across different scanner models and protocols.\u0000Recently, foundational models (FMs) trained on large datasets have gained\u0000attention for their ability to be adapted for downstream tasks and achieve\u0000state-of-the-art performance with excellent generalization capabilities on\u0000natural images. However, their effectiveness in medical image segmentation\u0000remains underexplored. In this paper, we investigate the domain generalization\u0000performance of various FMs, including DinoV2, SAM, MedSAM, and MAE, when\u0000fine-tuned using various parameter-efficient fine-tuning (PEFT) techniques such\u0000as Ladder and Rein (+LoRA) and decoder heads. We introduce a novel decode head\u0000architecture, HQHSAM, which simply integrates elements from two\u0000state-of-the-art decoder heads, HSAM and HQSAM, to enhance segmentation\u0000performance. Our extensive experiments on multiple datasets, encompassing\u0000various anatomies and modalities, reveal that FMs, particularly with the HQHSAM\u0000decode head, improve domain generalization for medical image segmentation.\u0000Moreover, we found that the effectiveness of PEFT techniques varies across\u0000different FMs. These findings underscore the potential of FMs to enhance the\u0000domain generalization performance of neural networks in medical image\u0000segmentation across diverse clinical settings, providing a solid foundation for\u0000future research. Code and models are available for research purposes at\u0000url{https://github.com/kerem-cekmeceli/Foundation-Models-for-Medical-Imagery}.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Longfei Liu, Wen Guo, Shihua Huang, Cheng Li, Xi Shen
{"title":"From COCO to COCO-FP: A Deep Dive into Background False Positives for COCO Detectors","authors":"Longfei Liu, Wen Guo, Shihua Huang, Cheng Li, Xi Shen","doi":"arxiv-2409.07907","DOIUrl":"https://doi.org/arxiv-2409.07907","url":null,"abstract":"Reducing false positives is essential for enhancing object detector\u0000performance, as reflected in the mean Average Precision (mAP) metric. Although\u0000object detectors have achieved notable improvements and high mAP scores on the\u0000COCO dataset, analysis reveals limited progress in addressing false positives\u0000caused by non-target visual clutter-background objects not included in the\u0000annotated categories. This issue is particularly critical in real-world\u0000applications, such as fire and smoke detection, where minimizing false alarms\u0000is crucial. In this study, we introduce COCO-FP, a new evaluation dataset\u0000derived from the ImageNet-1K dataset, designed to address this issue. By\u0000extending the original COCO validation dataset, COCO-FP specifically assesses\u0000object detectors' performance in mitigating background false positives. Our\u0000evaluation of both standard and advanced object detectors shows a significant\u0000number of false positives in both closed-set and open-set scenarios. For\u0000example, the AP50 metric for YOLOv9-E decreases from 72.8 to 65.7 when shifting\u0000from COCO to COCO-FP. The dataset is available at\u0000https://github.com/COCO-FP/COCO-FP.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors","authors":"Thomas Hanwen Zhu, Ruining Li, Tomas Jakab","doi":"arxiv-2409.08278","DOIUrl":"https://doi.org/arxiv-2409.08278","url":null,"abstract":"We present DreamHOI, a novel method for zero-shot synthesis of human-object\u0000interactions (HOIs), enabling a 3D human model to realistically interact with\u0000any given object based on a textual description. This task is complicated by\u0000the varying categories and geometries of real-world objects and the scarcity of\u0000datasets encompassing diverse HOIs. To circumvent the need for extensive data,\u0000we leverage text-to-image diffusion models trained on billions of image-caption\u0000pairs. We optimize the articulation of a skinned human mesh using Score\u0000Distillation Sampling (SDS) gradients obtained from these models, which predict\u0000image-space edits. However, directly backpropagating image-space gradients into\u0000complex articulation parameters is ineffective due to the local nature of such\u0000gradients. To overcome this, we introduce a dual implicit-explicit\u0000representation of a skinned mesh, combining (implicit) neural radiance fields\u0000(NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization,\u0000we transition between implicit and explicit forms, grounding the NeRF\u0000generation while refining the mesh articulation. We validate our approach\u0000through extensive experiments, demonstrating its effectiveness in generating\u0000realistic HOIs.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hassan Rasheed, Reuben Dorent, Maximilian Fehrentz, Tina Kapur, William M. Wells III, Alexandra Golby, Sarah Frisken, Julia A. Schnabel, Nazim Haouchine
{"title":"Learning to Match 2D Keypoints Across Preoperative MR and Intraoperative Ultrasound","authors":"Hassan Rasheed, Reuben Dorent, Maximilian Fehrentz, Tina Kapur, William M. Wells III, Alexandra Golby, Sarah Frisken, Julia A. Schnabel, Nazim Haouchine","doi":"arxiv-2409.08169","DOIUrl":"https://doi.org/arxiv-2409.08169","url":null,"abstract":"We propose in this paper a texture-invariant 2D keypoints descriptor\u0000specifically designed for matching preoperative Magnetic Resonance (MR) images\u0000with intraoperative Ultrasound (US) images. We introduce a\u0000matching-by-synthesis strategy, where intraoperative US images are synthesized\u0000from MR images accounting for multiple MR modalities and intraoperative US\u0000variability. We build our training set by enforcing keypoints localization over\u0000all images then train a patient-specific descriptor network that learns\u0000texture-invariant discriminant features in a supervised contrastive manner,\u0000leading to robust keypoints descriptors. Our experiments on real cases with\u0000ground truth show the effectiveness of the proposed approach, outperforming the\u0000state-of-the-art methods and achieving 80.35% matching precision on average.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sparse R-CNN OBB: Ship Target Detection in SAR Images Based on Oriented Sparse Proposals","authors":"Kamirul Kamirul, Odysseas Pappas, Alin Achim","doi":"arxiv-2409.07973","DOIUrl":"https://doi.org/arxiv-2409.07973","url":null,"abstract":"We present Sparse R-CNN OBB, a novel framework for the detection of oriented\u0000objects in SAR images leveraging sparse learnable proposals. The Sparse R-CNN\u0000OBB has streamlined architecture and ease of training as it utilizes a sparse\u0000set of 300 proposals instead of training a proposals generator on hundreds of\u0000thousands of anchors. To the best of our knowledge, Sparse R-CNN OBB is the\u0000first to adopt the concept of sparse learnable proposals for the detection of\u0000oriented objects, as well as for the detection of ships in Synthetic Aperture\u0000Radar (SAR) images. The detection head of the baseline model, Sparse R-CNN, is\u0000re-designed to enable the model to capture object orientation. We also\u0000fine-tune the model on RSDD-SAR dataset and provide a performance comparison to\u0000state-of-the-art models. Experimental results shows that Sparse R-CNN OBB\u0000achieves outstanding performance, surpassing other models on both inshore and\u0000offshore scenarios. The code is available at:\u0000www.github.com/ka-mirul/Sparse-R-CNN-OBB.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization","authors":"Ling Xing, Hongyu Qu, Rui Yan, Xiangbo Shu, Jinhui Tang","doi":"arxiv-2409.07967","DOIUrl":"https://doi.org/arxiv-2409.07967","url":null,"abstract":"Dense-localization Audio-Visual Events (DAVE) aims to identify time\u0000boundaries and corresponding categories for events that can be heard and seen\u0000concurrently in an untrimmed video. Existing methods typically encode audio and\u0000visual representation separately without any explicit cross-modal alignment\u0000constraint. Then they adopt dense cross-modal attention to integrate multimodal\u0000information for DAVE. Thus these methods inevitably aggregate irrelevant noise\u0000and events, especially in complex and long videos, leading to imprecise\u0000detection. In this paper, we present LOCO, a Locality-aware cross-modal\u0000Correspondence learning framework for DAVE. The core idea is to explore local\u0000temporal continuity nature of audio-visual events, which serves as informative\u0000yet free supervision signals to guide the filtering of irrelevant information\u0000and inspire the extraction of complementary multimodal information during both\u0000unimodal and cross-modal learning stages. i) Specifically, LOCO applies\u0000Locality-aware Correspondence Correction (LCC) to uni-modal features via\u0000leveraging cross-modal local-correlated properties without any extra\u0000annotations. This enforces uni-modal encoders to highlight similar semantics\u0000shared by audio and visual features. ii) To better aggregate such audio and\u0000visual features, we further customize Cross-modal Dynamic Perception layer\u0000(CDP) in cross-modal feature pyramid to understand local temporal patterns of\u0000audio-visual events by imposing local consistency within multimodal features in\u0000a data-driven manner. By incorporating LCC and CDP, LOCO provides solid\u0000performance gains and outperforms existing methods for DAVE. The source code\u0000will be released.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rongfeng Lu, Hangyu Chen, Zunjie Zhu, Yuhang Qin, Ming Lu, Le Zhang, Chenggang Yan, Anke Xue
{"title":"ThermalGaussian: Thermal 3D Gaussian Splatting","authors":"Rongfeng Lu, Hangyu Chen, Zunjie Zhu, Yuhang Qin, Ming Lu, Le Zhang, Chenggang Yan, Anke Xue","doi":"arxiv-2409.07200","DOIUrl":"https://doi.org/arxiv-2409.07200","url":null,"abstract":"Thermography is especially valuable for the military and other users of\u0000surveillance cameras. Some recent methods based on Neural Radiance Fields\u0000(NeRF) are proposed to reconstruct the thermal scenes in 3D from a set of\u0000thermal and RGB images. However, unlike NeRF, 3D Gaussian splatting (3DGS)\u0000prevails due to its rapid training and real-time rendering. In this work, we\u0000propose ThermalGaussian, the first thermal 3DGS approach capable of rendering\u0000high-quality images in RGB and thermal modalities. We first calibrate the RGB\u0000camera and the thermal camera to ensure that both modalities are accurately\u0000aligned. Subsequently, we use the registered images to learn the multimodal 3D\u0000Gaussians. To prevent the overfitting of any single modality, we introduce\u0000several multimodal regularization constraints. We also develop smoothing\u0000constraints tailored to the physical characteristics of the thermal modality.\u0000Besides, we contribute a real-world dataset named RGBT-Scenes, captured by a\u0000hand-hold thermal-infrared camera, facilitating future research on thermal\u0000scene reconstruction. We conduct comprehensive experiments to show that\u0000ThermalGaussian achieves photorealistic rendering of thermal images and\u0000improves the rendering quality of RGB images. With the proposed multimodal\u0000regularization constraints, we also reduced the model's storage cost by 90%.\u0000The code and dataset will be released.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}