{"title":"SFDA-rPPG: Source-Free Domain Adaptive Remote Physiological Measurement with Spatio-Temporal Consistency","authors":"Yiping Xie, Zitong Yu, Bingjie Wu, Weicheng Xie, Linlin Shen","doi":"arxiv-2409.12040","DOIUrl":"https://doi.org/arxiv-2409.12040","url":null,"abstract":"Remote Photoplethysmography (rPPG) is a non-contact method that uses facial\u0000video to predict changes in blood volume, enabling physiological metrics\u0000measurement. Traditional rPPG models often struggle with poor generalization\u0000capacity in unseen domains. Current solutions to this problem is to improve its\u0000generalization in the target domain through Domain Generalization (DG) or\u0000Domain Adaptation (DA). However, both traditional methods require access to\u0000both source domain data and target domain data, which cannot be implemented in\u0000scenarios with limited access to source data, and another issue is the privacy\u0000of accessing source domain data. In this paper, we propose the first\u0000Source-free Domain Adaptation benchmark for rPPG measurement (SFDA-rPPG), which\u0000overcomes these limitations by enabling effective domain adaptation without\u0000access to source domain data. Our framework incorporates a Three-Branch\u0000Spatio-Temporal Consistency Network (TSTC-Net) to enhance feature consistency\u0000across domains. Furthermore, we propose a new rPPG distribution alignment loss\u0000based on the Frequency-domain Wasserstein Distance (FWD), which leverages\u0000optimal transport to align power spectrum distributions across domains\u0000effectively and further enforces the alignment of the three branches. Extensive\u0000cross-domain experiments and ablation studies demonstrate the effectiveness of\u0000our proposed method in source-free domain adaptation settings. Our findings\u0000highlight the significant contribution of the proposed FWD loss for\u0000distributional alignment, providing a valuable reference for future research\u0000and applications. The source code is available at\u0000https://github.com/XieYiping66/SFDA-rPPG","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yassine Himeur, Nour Aburaed, Omar Elharrouss, Iraklis Varlamis, Shadi Atalla, Wathiq Mansoor, Hussain Al Ahmad
{"title":"Applications of Knowledge Distillation in Remote Sensing: A Survey","authors":"Yassine Himeur, Nour Aburaed, Omar Elharrouss, Iraklis Varlamis, Shadi Atalla, Wathiq Mansoor, Hussain Al Ahmad","doi":"arxiv-2409.12111","DOIUrl":"https://doi.org/arxiv-2409.12111","url":null,"abstract":"With the ever-growing complexity of models in the field of remote sensing\u0000(RS), there is an increasing demand for solutions that balance model accuracy\u0000with computational efficiency. Knowledge distillation (KD) has emerged as a\u0000powerful tool to meet this need, enabling the transfer of knowledge from large,\u0000complex models to smaller, more efficient ones without significant loss in\u0000performance. This review article provides an extensive examination of KD and\u0000its innovative applications in RS. KD, a technique developed to transfer\u0000knowledge from a complex, often cumbersome model (teacher) to a more compact\u0000and efficient model (student), has seen significant evolution and application\u0000across various domains. Initially, we introduce the fundamental concepts and\u0000historical progression of KD methods. The advantages of employing KD are\u0000highlighted, particularly in terms of model compression, enhanced computational\u0000efficiency, and improved performance, which are pivotal for practical\u0000deployments in RS scenarios. The article provides a comprehensive taxonomy of\u0000KD techniques, where each category is critically analyzed to demonstrate the\u0000breadth and depth of the alternative options, and illustrates specific case\u0000studies that showcase the practical implementation of KD methods in RS tasks,\u0000such as instance segmentation and object detection. Further, the review\u0000discusses the challenges and limitations of KD in RS, including practical\u0000constraints and prospective future directions, providing a comprehensive\u0000overview for researchers and practitioners in the field of RS. Through this\u0000organization, the paper not only elucidates the current state of research in KD\u0000but also sets the stage for future research opportunities, thereby contributing\u0000significantly to both academic research and real-world applications.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EFCM: Efficient Fine-tuning on Compressed Models for deployment of large models in medical image analysis","authors":"Shaojie Li, Zhaoshuo Diao","doi":"arxiv-2409.11817","DOIUrl":"https://doi.org/arxiv-2409.11817","url":null,"abstract":"The recent development of deep learning large models in medicine shows\u0000remarkable performance in medical image analysis and diagnosis, but their large\u0000number of parameters causes memory and inference latency challenges. Knowledge\u0000distillation offers a solution, but the slide-level gradients cannot be\u0000backpropagated for student model updates due to high-resolution pathological\u0000images and slide-level labels. This study presents an Efficient Fine-tuning on\u0000Compressed Models (EFCM) framework with two stages: unsupervised feature\u0000distillation and fine-tuning. In the distillation stage, Feature Projection\u0000Distillation (FPD) is proposed with a TransScan module for adaptive receptive\u0000field adjustment to enhance the knowledge absorption capability of the student\u0000model. In the slide-level fine-tuning stage, three strategies (Reuse CLAM,\u0000Retrain CLAM, and End2end Train CLAM (ETC)) are compared. Experiments are\u0000conducted on 11 downstream datasets related to three large medical models:\u0000RETFound for retina, MRM for chest X-ray, and BROW for histopathology. The\u0000experimental results demonstrate that the EFCM framework significantly improves\u0000accuracy and efficiency in handling slide-level pathological image problems,\u0000effectively addressing the challenges of deploying large medical models.\u0000Specifically, it achieves a 4.33% increase in ACC and a 5.2% increase in AUC\u0000compared to the large model BROW on the TCGA-NSCLC and TCGA-BRCA datasets. The\u0000analysis of model inference efficiency highlights the high efficiency of the\u0000distillation fine-tuning method.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Neural Encoding for Image Recall: Human-Like Memory","authors":"Virgile Foussereau, Robin Dumas","doi":"arxiv-2409.11750","DOIUrl":"https://doi.org/arxiv-2409.11750","url":null,"abstract":"Achieving human-like memory recall in artificial systems remains a\u0000challenging frontier in computer vision. Humans demonstrate remarkable ability\u0000to recall images after a single exposure, even after being shown thousands of\u0000images. However, this capacity diminishes significantly when confronted with\u0000non-natural stimuli such as random textures. In this paper, we present a method\u0000inspired by human memory processes to bridge this gap between artificial and\u0000biological memory systems. Our approach focuses on encoding images to mimic the\u0000high-level information retained by the human brain, rather than storing raw\u0000pixel data. By adding noise to images before encoding, we introduce variability\u0000akin to the non-deterministic nature of human memory encoding. Leveraging\u0000pre-trained models' embedding layers, we explore how different architectures\u0000encode images and their impact on memory recall. Our method achieves impressive\u0000results, with 97% accuracy on natural images and near-random performance (52%)\u0000on textures. We provide insights into the encoding process and its implications\u0000for machine learning memory systems, shedding light on the parallels between\u0000human and artificial intelligence memory mechanisms.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pritesh Prakash, Koteswar Rao Jerripothula, Ashish Jacob Sam, Prinsh Kumar Singh, S Umamaheswaran
{"title":"SymFace: Additional Facial Symmetry Loss for Deep Face Recognition","authors":"Pritesh Prakash, Koteswar Rao Jerripothula, Ashish Jacob Sam, Prinsh Kumar Singh, S Umamaheswaran","doi":"arxiv-2409.11816","DOIUrl":"https://doi.org/arxiv-2409.11816","url":null,"abstract":"Over the past decade, there has been a steady advancement in enhancing face\u0000recognition algorithms leveraging advanced machine learning methods. The role\u0000of the loss function is pivotal in addressing face verification problems and\u0000playing a game-changing role. These loss functions have mainly explored\u0000variations among intra-class or inter-class separation. This research examines\u0000the natural phenomenon of facial symmetry in the face verification problem. The\u0000symmetry between the left and right hemi faces has been widely used in many\u0000research areas in recent decades. This paper adopts this simple approach\u0000judiciously by splitting the face image vertically into two halves. With the\u0000assumption that the natural phenomena of facial symmetry can enhance face\u0000verification methodology, we hypothesize that the two output embedding vectors\u0000of split faces must project close to each other in the output embedding space.\u0000Inspired by this concept, we penalize the network based on the disparity of\u0000embedding of the symmetrical pair of split faces. Symmetrical loss has the\u0000potential to minimize minor asymmetric features due to facial expression and\u0000lightning conditions, hence significantly increasing the inter-class variance\u0000among the classes and leading to more reliable face embedding. This loss\u0000function propels any network to outperform its baseline performance across all\u0000existing network architectures and configurations, enabling us to achieve SoTA\u0000results.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ludan Zhang, Xiaokang Ding, Yuqi Dai, Lei He, Keqiang Li
{"title":"Unveiling the Black Box: Independent Functional Module Evaluation for Bird's-Eye-View Perception Model","authors":"Ludan Zhang, Xiaokang Ding, Yuqi Dai, Lei He, Keqiang Li","doi":"arxiv-2409.11969","DOIUrl":"https://doi.org/arxiv-2409.11969","url":null,"abstract":"End-to-end models are emerging as the mainstream in autonomous driving\u0000perception. However, the inability to meticulously deconstruct their internal\u0000mechanisms results in diminished development efficacy and impedes the\u0000establishment of trust. Pioneering in the issue, we present the Independent\u0000Functional Module Evaluation for Bird's-Eye-View Perception Model (BEV-IFME), a\u0000novel framework that juxtaposes the module's feature maps against Ground Truth\u0000within a unified semantic Representation Space to quantify their similarity,\u0000thereby assessing the training maturity of individual functional modules. The\u0000core of the framework lies in the process of feature map encoding and\u0000representation aligning, facilitated by our proposed two-stage Alignment\u0000AutoEncoder, which ensures the preservation of salient information and the\u0000consistency of feature structure. The metric for evaluating the training\u0000maturity of functional modules, Similarity Score, demonstrates a robust\u0000positive correlation with BEV metrics, with an average correlation coefficient\u0000of 0.9387, attesting to the framework's reliability for assessment purposes.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EventAug: Multifaceted Spatio-Temporal Data Augmentation Methods for Event-based Learning","authors":"Yukun Tian, Hao Chen, Yongjian Deng, Feihong Shen, Kepan Liu, Wei You, Ziyang Zhang","doi":"arxiv-2409.11813","DOIUrl":"https://doi.org/arxiv-2409.11813","url":null,"abstract":"The event camera has demonstrated significant success across a wide range of\u0000areas due to its low time latency and high dynamic range. However, the\u0000community faces challenges such as data deficiency and limited diversity, often\u0000resulting in over-fitting and inadequate feature learning. Notably, the\u0000exploration of data augmentation techniques in the event community remains\u0000scarce. This work aims to address this gap by introducing a systematic\u0000augmentation scheme named EventAug to enrich spatial-temporal diversity. In\u0000particular, we first propose Multi-scale Temporal Integration (MSTI) to\u0000diversify the motion speed of objects, then introduce Spatial-salient Event\u0000Mask (SSEM) and Temporal-salient Event Mask (TSEM) to enrich object variants.\u0000Our EventAug can facilitate models learning with richer motion patterns, object\u0000variants and local spatio-temporal relations, thus improving model robustness\u0000to varied moving speeds, occlusions, and action disruptions. Experiment results\u0000show that our augmentation method consistently yields significant improvements\u0000across different tasks and backbones (e.g., a 4.87% accuracy gain on DVS128\u0000Gesture). Our code will be publicly available for this community.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin
{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","authors":"Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin","doi":"arxiv-2409.12191","DOIUrl":"https://doi.org/arxiv-2409.12191","url":null,"abstract":"We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL\u0000models that redefines the conventional predetermined-resolution approach in\u0000visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism,\u0000which enables the model to dynamically process images of varying resolutions\u0000into different numbers of visual tokens. This approach allows the model to\u0000generate more efficient and accurate visual representations, closely aligning\u0000with human perceptual processes. The model also integrates Multimodal Rotary\u0000Position Embedding (M-RoPE), facilitating the effective fusion of positional\u0000information across text, images, and videos. We employ a unified paradigm for\u0000processing both images and videos, enhancing the model's visual perception\u0000capabilities. To explore the potential of large multimodal models, Qwen2-VL\u0000investigates the scaling laws for large vision-language models (LVLMs). By\u0000scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the\u0000amount of training data, the Qwen2-VL Series achieves highly competitive\u0000performance. Notably, the Qwen2-VL-72B model achieves results comparable to\u0000leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal\u0000benchmarks, outperforming other generalist models. Code is available at\u0000url{https://github.com/QwenLM/Qwen2-VL}.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maximilian Fehrentz, Mohammad Farid Azampour, Reuben Dorent, Hassan Rasheed, Colin Galvin, Alexandra Golby, William M. Wells, Sarah Frisken, Nassir Navab, Nazim Haouchine
{"title":"Intraoperative Registration by Cross-Modal Inverse Neural Rendering","authors":"Maximilian Fehrentz, Mohammad Farid Azampour, Reuben Dorent, Hassan Rasheed, Colin Galvin, Alexandra Golby, William M. Wells, Sarah Frisken, Nassir Navab, Nazim Haouchine","doi":"arxiv-2409.11983","DOIUrl":"https://doi.org/arxiv-2409.11983","url":null,"abstract":"We present in this paper a novel approach for 3D/2D intraoperative\u0000registration during neurosurgery via cross-modal inverse neural rendering. Our\u0000approach separates implicit neural representation into two components, handling\u0000anatomical structure preoperatively and appearance intraoperatively. This\u0000disentanglement is achieved by controlling a Neural Radiance Field's appearance\u0000with a multi-style hypernetwork. Once trained, the implicit neural\u0000representation serves as a differentiable rendering engine, which can be used\u0000to estimate the surgical camera pose by minimizing the dissimilarity between\u0000its rendered images and the target intraoperative image. We tested our method\u0000on retrospective patients' data from clinical cases, showing that our method\u0000outperforms state-of-the-art while meeting current clinical standards for\u0000registration. Code and additional resources can be found at\u0000https://maxfehrentz.github.io/style-ngp/.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RopeBEV: A Multi-Camera Roadside Perception Network in Bird's-Eye-View","authors":"Jinrang Jia, Guangqi Yi, Yifeng Shi","doi":"arxiv-2409.11706","DOIUrl":"https://doi.org/arxiv-2409.11706","url":null,"abstract":"Multi-camera perception methods in Bird's-Eye-View (BEV) have gained wide\u0000application in autonomous driving. However, due to the differences between\u0000roadside and vehicle-side scenarios, there currently lacks a multi-camera BEV\u0000solution in roadside. This paper systematically analyzes the key challenges in\u0000multi-camera BEV perception for roadside scenarios compared to vehicle-side.\u0000These challenges include the diversity in camera poses, the uncertainty in\u0000Camera numbers, the sparsity in perception regions, and the ambiguity in\u0000orientation angles. In response, we introduce RopeBEV, the first dense\u0000multi-camera BEV approach. RopeBEV introduces BEV augmentation to address the\u0000training balance issues caused by diverse camera poses. By incorporating\u0000CamMask and ROIMask (Region of Interest Mask), it supports variable camera\u0000numbers and sparse perception, respectively. Finally, camera rotation embedding\u0000is utilized to resolve orientation ambiguity. Our method ranks 1st on the\u0000real-world highway dataset RoScenes and demonstrates its practical value on a\u0000private urban dataset that covers more than 50 intersections and 600 cameras.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}