{"title":"RopeBEV: A Multi-Camera Roadside Perception Network in Bird's-Eye-View","authors":"Jinrang Jia, Guangqi Yi, Yifeng Shi","doi":"arxiv-2409.11706","DOIUrl":"https://doi.org/arxiv-2409.11706","url":null,"abstract":"Multi-camera perception methods in Bird's-Eye-View (BEV) have gained wide\u0000application in autonomous driving. However, due to the differences between\u0000roadside and vehicle-side scenarios, there currently lacks a multi-camera BEV\u0000solution in roadside. This paper systematically analyzes the key challenges in\u0000multi-camera BEV perception for roadside scenarios compared to vehicle-side.\u0000These challenges include the diversity in camera poses, the uncertainty in\u0000Camera numbers, the sparsity in perception regions, and the ambiguity in\u0000orientation angles. In response, we introduce RopeBEV, the first dense\u0000multi-camera BEV approach. RopeBEV introduces BEV augmentation to address the\u0000training balance issues caused by diverse camera poses. By incorporating\u0000CamMask and ROIMask (Region of Interest Mask), it supports variable camera\u0000numbers and sparse perception, respectively. Finally, camera rotation embedding\u0000is utilized to resolve orientation ambiguity. Our method ranks 1st on the\u0000real-world highway dataset RoScenes and demonstrates its practical value on a\u0000private urban dataset that covers more than 50 intersections and 600 cameras.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SpheriGait: Enriching Spatial Representation via Spherical Projection for LiDAR-based Gait Recognition","authors":"Yanxi Wang, Zhigang Chang, Chen Wu, Zihao Cheng, Hongmin Gao","doi":"arxiv-2409.11869","DOIUrl":"https://doi.org/arxiv-2409.11869","url":null,"abstract":"Gait recognition is a rapidly progressing technique for the remote\u0000identification of individuals. Prior research predominantly employing 2D\u0000sensors to gather gait data has achieved notable advancements; nonetheless,\u0000they have unavoidably neglected the influence of 3D dynamic characteristics on\u0000recognition. Gait recognition utilizing LiDAR 3D point clouds not only directly\u0000captures 3D spatial features but also diminishes the impact of lighting\u0000conditions while ensuring privacy protection.The essence of the problem lies in\u0000how to effectively extract discriminative 3D dynamic representation from point\u0000clouds.In this paper, we proposes a method named SpheriGait for extracting and\u0000enhancing dynamic features from point clouds for Lidar-based gait recognition.\u0000Specifically, it substitutes the conventional point cloud plane projection\u0000method with spherical projection to augment the perception of dynamic\u0000feature.Additionally, a network block named DAM-L is proposed to extract gait\u0000cues from the projected point cloud data. We conducted extensive experiments\u0000and the results demonstrated the SpheriGait achieved state-of-the-art\u0000performance on the SUSTech1K dataset, and verified that the spherical\u0000projection method can serve as a universal data preprocessing technique to\u0000enhance the performance of other LiDAR-based gait recognition methods,\u0000exhibiting exceptional flexibility and practicality.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abdul Wahab, Tariq Mahmood Khan, Shahzaib Iqbal, Bandar AlShammari, Bandar Alhaqbani, Imran Razzak
{"title":"Latent fingerprint enhancement for accurate minutiae detection","authors":"Abdul Wahab, Tariq Mahmood Khan, Shahzaib Iqbal, Bandar AlShammari, Bandar Alhaqbani, Imran Razzak","doi":"arxiv-2409.11802","DOIUrl":"https://doi.org/arxiv-2409.11802","url":null,"abstract":"Identification of suspects based on partial and smudged fingerprints,\u0000commonly referred to as fingermarks or latent fingerprints, presents a\u0000significant challenge in the field of fingerprint recognition. Although\u0000fixed-length embeddings have shown effectiveness in recognising rolled and slap\u0000fingerprints, the methods for matching latent fingerprints have primarily\u0000centred around local minutiae-based embeddings, failing to fully exploit global\u0000representations for matching purposes. Consequently, enhancing latent\u0000fingerprints becomes critical to ensuring robust identification for forensic\u0000investigations. Current approaches often prioritise restoring ridge patterns,\u0000overlooking the fine-macroeconomic details crucial for accurate fingerprint\u0000recognition. To address this, we propose a novel approach that uses generative\u0000adversary networks (GANs) to redefine Latent Fingerprint Enhancement (LFE)\u0000through a structured approach to fingerprint generation. By directly optimising\u0000the minutiae information during the generation process, the model produces\u0000enhanced latent fingerprints that exhibit exceptional fidelity to ground-truth\u0000instances. This leads to a significant improvement in identification\u0000performance. Our framework integrates minutiae locations and orientation\u0000fields, ensuring the preservation of both local and structural fingerprint\u0000features. Extensive evaluations conducted on two publicly available datasets\u0000demonstrate our method's dominance over existing state-of-the-art techniques,\u0000highlighting its potential to significantly enhance latent fingerprint\u0000recognition accuracy in forensic applications.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SPRMamba: Surgical Phase Recognition for Endoscopic Submucosal Dissection with Mamba","authors":"Xiangning Zhang, Jinnan Chen, Qingwei Zhang, Chengfeng Zhou, Zhengjie Zhang, Xiaobo Li, Dahong Qian","doi":"arxiv-2409.12108","DOIUrl":"https://doi.org/arxiv-2409.12108","url":null,"abstract":"Endoscopic Submucosal Dissection (ESD) is a minimally invasive procedure\u0000initially designed for the treatment of early gastric cancer but is now widely\u0000used for various gastrointestinal lesions. Computer-assisted Surgery systems\u0000have played a crucial role in improving the precision and safety of ESD\u0000procedures, however, their effectiveness is limited by the accurate recognition\u0000of surgical phases. The intricate nature of ESD, with different lesion\u0000characteristics and tissue structures, presents challenges for real-time\u0000surgical phase recognition algorithms. Existing surgical phase recognition\u0000algorithms struggle to efficiently capture temporal contexts in video-based\u0000scenarios, leading to insufficient performance. To address these issues, we\u0000propose SPRMamba, a novel Mamba-based framework for ESD surgical phase\u0000recognition. SPRMamba leverages the strengths of Mamba for long-term temporal\u0000modeling while introducing the Scaled Residual TranMamba block to enhance the\u0000capture of fine-grained details, overcoming the limitations of traditional\u0000temporal models like Temporal Convolutional Networks and Transformers.\u0000Moreover, a Temporal Sample Strategy is introduced to accelerate the\u0000processing, which is essential for real-time phase recognition in clinical\u0000settings. Extensive testing on the ESD385 dataset and the cholecystectomy\u0000Cholec80 dataset demonstrates that SPRMamba surpasses existing state-of-the-art\u0000methods and exhibits greater robustness across various surgical phase\u0000recognition tasks.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"155 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BRDF-NeRF: Neural Radiance Fields with Optical Satellite Images and BRDF Modelling","authors":"Lulin Zhang, Ewelina Rupnik, Tri Dung Nguyen, Stéphane Jacquemoud, Yann Klinger","doi":"arxiv-2409.12014","DOIUrl":"https://doi.org/arxiv-2409.12014","url":null,"abstract":"Understanding the anisotropic reflectance of complex Earth surfaces from\u0000satellite imagery is crucial for numerous applications. Neural radiance fields\u0000(NeRF) have become popular as a machine learning technique capable of deducing\u0000the bidirectional reflectance distribution function (BRDF) of a scene from\u0000multiple images. However, prior research has largely concentrated on applying\u0000NeRF to close-range imagery, estimating basic Microfacet BRDF models, which\u0000fall short for many Earth surfaces. Moreover, high-quality NeRFs generally\u0000require several images captured simultaneously, a rare occurrence in satellite\u0000imaging. To address these limitations, we propose BRDF-NeRF, developed to\u0000explicitly estimate the Rahman-Pinty-Verstraete (RPV) model, a semi-empirical\u0000BRDF model commonly employed in remote sensing. We assess our approach using\u0000two datasets: (1) Djibouti, captured in a single epoch at varying viewing\u0000angles with a fixed Sun position, and (2) Lanzhou, captured over multiple\u0000epochs with different viewing angles and Sun positions. Our results, based on\u0000only three to four satellite images for training, demonstrate that BRDF-NeRF\u0000can effectively synthesize novel views from directions far removed from the\u0000training data and produce high-quality digital surface models (DSMs).","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"22 2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression","authors":"Yuan Tian, Guo Lu, Guangtao Zhai","doi":"arxiv-2409.11718","DOIUrl":"https://doi.org/arxiv-2409.11718","url":null,"abstract":"Unsupervised video semantic compression (UVSC), i.e., compressing videos to\u0000better support various analysis tasks, has recently garnered attention.\u0000However, the semantic richness of previous methods remains limited, due to the\u0000single semantic learning objective, limited training data, etc. To address\u0000this, we propose to boost the UVSC task by absorbing the off-the-shelf rich\u0000semantics from VFMs. Specifically, we introduce a VFMs-shared semantic\u0000alignment layer, complemented by VFM-specific prompts, to flexibly align\u0000semantics between the compressed video and various VFMs. This allows different\u0000VFMs to collaboratively build a mutually-enhanced semantic space, guiding the\u0000learning of the compression model. Moreover, we introduce a dynamic\u0000trajectory-based inter-frame compression scheme, which first estimates the\u0000semantic trajectory based on the historical content, and then traverses along\u0000the trajectory to predict the future semantics as the coding context. This\u0000reduces the overall bitcost of the system, further improving the compression\u0000efficiency. Our approach outperforms previous coding methods on three\u0000mainstream tasks and six datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance","authors":"Jaehoon Joo, Taejin Jeong, Seongjae Hwang","doi":"arxiv-2409.12099","DOIUrl":"https://doi.org/arxiv-2409.12099","url":null,"abstract":"Understanding how humans process visual information is one of the crucial\u0000steps for unraveling the underlying mechanism of brain activity. Recently, this\u0000curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI\u0000data from visual stimuli, it aims to reconstruct the corresponding visual\u0000stimuli. Surprisingly, leveraging powerful generative models such as the Latent\u0000Diffusion Model (LDM) has shown promising results in reconstructing complex\u0000visual stimuli such as high-resolution natural images from vision datasets.\u0000Despite the impressive structural fidelity of these reconstructions, they often\u0000lack details of small objects, ambiguous shapes, and semantic nuances.\u0000Consequently, the incorporation of additional semantic knowledge, beyond mere\u0000visuals, becomes imperative. In light of this, we exploit how modern LDMs\u0000effectively incorporate multi-modal guidance (text guidance, visual guidance,\u0000and image layout) for structurally and semantically plausible image\u0000generations. Specifically, inspired by the two-streams hypothesis suggesting\u0000that perceptual and semantic information are processed in different brain\u0000regions, our framework, Brain-Streams, maps fMRI signals from these brain\u0000regions to appropriate embeddings. That is, by extracting textual guidance from\u0000semantic information regions and visual guidance from perceptual information\u0000regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We\u0000validate the reconstruction ability of Brain-Streams both quantitatively and\u0000qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI\u0000data.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework","authors":"Xiaoyu Li, Peidong Li, Lijun Zhao, Dedong Liu, Jinghan Gao, Xian Wu, Yitao Wu, Dixiao Cui","doi":"arxiv-2409.11749","DOIUrl":"https://doi.org/arxiv-2409.11749","url":null,"abstract":"3D Multi-Object Tracking (MOT) obtains significant performance improvements\u0000with the rapid advancements in 3D object detection, particularly in\u0000cost-effective multi-camera setups. However, the prevalent end-to-end training\u0000approach for multi-camera trackers results in detector-specific models,\u0000limiting their versatility. Moreover, current generic trackers overlook the\u0000unique features of multi-camera detectors, i.e., the unreliability of motion\u0000observations and the feasibility of visual information. To address these\u0000challenges, we propose RockTrack, a 3D MOT method for multi-camera detectors.\u0000Following the Tracking-By-Detection framework, RockTrack is compatible with\u0000various off-the-shelf detectors. RockTrack incorporates a confidence-guided\u0000preprocessing module to extract reliable motion and image observations from\u0000distinct representation spaces from a single detector. These observations are\u0000then fused in an association module that leverages geometric and appearance\u0000cues to minimize mismatches. The resulting matches are propagated through a\u0000staged estimation process, forming the basis for heuristic noise modeling.\u0000Additionally, we introduce a novel appearance similarity metric for explicitly\u0000characterizing object affinities in multi-camera settings. RockTrack achieves\u0000state-of-the-art performance on the nuScenes vision-only tracking leaderboard\u0000with 59.1% AMOTA while demonstrating impressive computational efficiency.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiaxiong Liu, Bo Wang, Zhen Tan, Jinpu Zhang, Hui Shen, Dewen Hu
{"title":"Tracking Any Point with Frame-Event Fusion Network at High Frame Rate","authors":"Jiaxiong Liu, Bo Wang, Zhen Tan, Jinpu Zhang, Hui Shen, Dewen Hu","doi":"arxiv-2409.11953","DOIUrl":"https://doi.org/arxiv-2409.11953","url":null,"abstract":"Tracking any point based on image frames is constrained by frame rates,\u0000leading to instability in high-speed scenarios and limited generalization in\u0000real-world applications. To overcome these limitations, we propose an\u0000image-event fusion point tracker, FE-TAP, which combines the contextual\u0000information from image frames with the high temporal resolution of events,\u0000achieving high frame rate and robust point tracking under various challenging\u0000conditions. Specifically, we designed an Evolution Fusion module (EvoFusion) to\u0000model the image generation process guided by events. This module can\u0000effectively integrate valuable information from both modalities operating at\u0000different frequencies. To achieve smoother point trajectories, we employed a\u0000transformer-based refinement strategy that updates the point's trajectories and\u0000features iteratively. Extensive experiments demonstrate that our method\u0000outperforms state-of-the-art approaches, particularly improving expected\u0000feature age by 24$%$ on EDS datasets. Finally, we qualitatively validated the\u0000robustness of our algorithm in real driving scenarios using our custom-designed\u0000high-resolution image-event synchronization device. Our source code will be\u0000released at https://github.com/ljx1002/FE-TAP.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Furkan Mert Algan, Umut Yazgan, Driton Salihu, Cem Eteke, Eckehard Steinbach
{"title":"LEMON: Localized Editing with Mesh Optimization and Neural Shaders","authors":"Furkan Mert Algan, Umut Yazgan, Driton Salihu, Cem Eteke, Eckehard Steinbach","doi":"arxiv-2409.12024","DOIUrl":"https://doi.org/arxiv-2409.12024","url":null,"abstract":"In practical use cases, polygonal mesh editing can be faster than generating\u0000new ones, but it can still be challenging and time-consuming for users.\u0000Existing solutions for this problem tend to focus on a single task, either\u0000geometry or novel view synthesis, which often leads to disjointed results\u0000between the mesh and view. In this work, we propose LEMON, a mesh editing\u0000pipeline that combines neural deferred shading with localized mesh\u0000optimization. Our approach begins by identifying the most important vertices in\u0000the mesh for editing, utilizing a segmentation model to focus on these key\u0000regions. Given multi-view images of an object, we optimize a neural shader and\u0000a polygonal mesh while extracting the normal map and the rendered image from\u0000each view. By using these outputs as conditioning data, we edit the input\u0000images with a text-to-image diffusion model and iteratively update our dataset\u0000while deforming the mesh. This process results in a polygonal mesh that is\u0000edited according to the given text instruction, preserving the geometric\u0000characteristics of the initial mesh while focusing on the most significant\u0000areas. We evaluate our pipeline using the DTU dataset, demonstrating that it\u0000generates finely-edited meshes more rapidly than the current state-of-the-art\u0000methods. We include our code and additional results in the supplementary\u0000material.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}