{"title":"PhysMamba: Efficient Remote Physiological Measurement with SlowFast Temporal Difference Mamba","authors":"Chaoqi Luo, Yiping Xie, Zitong Yu","doi":"arxiv-2409.12031","DOIUrl":"https://doi.org/arxiv-2409.12031","url":null,"abstract":"Facial-video based Remote photoplethysmography (rPPG) aims at measuring\u0000physiological signals and monitoring heart activity without any contact,\u0000showing significant potential in various applications. Previous deep learning\u0000based rPPG measurement are primarily based on CNNs and Transformers. However,\u0000the limited receptive fields of CNNs restrict their ability to capture\u0000long-range spatio-temporal dependencies, while Transformers also struggle with\u0000modeling long video sequences with high complexity. Recently, the state space\u0000models (SSMs) represented by Mamba are known for their impressive performance\u0000on capturing long-range dependencies from long sequences. In this paper, we\u0000propose the PhysMamba, a Mamba-based framework, to efficiently represent\u0000long-range physiological dependencies from facial videos. Specifically, we\u0000introduce the Temporal Difference Mamba block to first enhance local dynamic\u0000differences and further model the long-range spatio-temporal context. Moreover,\u0000a dual-stream SlowFast architecture is utilized to fuse the multi-scale\u0000temporal features. Extensive experiments are conducted on three benchmark\u0000datasets to demonstrate the superiority and efficiency of PhysMamba. The codes\u0000are available at https://github.com/Chaoqi31/PhysMamba","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Massively Multi-Person 3D Human Motion Forecasting with Scene Context","authors":"Felix B Mueller, Julian Tanke, Juergen Gall","doi":"arxiv-2409.12189","DOIUrl":"https://doi.org/arxiv-2409.12189","url":null,"abstract":"Forecasting long-term 3D human motion is challenging: the stochasticity of\u0000human behavior makes it hard to generate realistic human motion from the input\u0000sequence alone. Information on the scene environment and the motion of nearby\u0000people can greatly aid the generation process. We propose a scene-aware social\u0000transformer model (SAST) to forecast long-term (10s) human motion motion.\u0000Unlike previous models, our approach can model interactions between both widely\u0000varying numbers of people and objects in a scene. We combine a temporal\u0000convolutional encoder-decoder architecture with a Transformer-based bottleneck\u0000that allows us to efficiently combine motion and scene information. We model\u0000the conditional motion distribution using denoising diffusion models. We\u0000benchmark our approach on the Humans in Kitchens dataset, which contains 1 to\u000016 persons and 29 to 50 objects that are visible simultaneously. Our model\u0000outperforms other approaches in terms of realism and diversity on different\u0000metrics and in a user study. Code is available at\u0000https://github.com/felixbmuller/SAST.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peiyu Li, Xiaobao Huang, Yijun Tian, Nitesh V. Chawla
{"title":"ChefFusion: Multimodal Foundation Model Integrating Recipe and Food Image Generation","authors":"Peiyu Li, Xiaobao Huang, Yijun Tian, Nitesh V. Chawla","doi":"arxiv-2409.12010","DOIUrl":"https://doi.org/arxiv-2409.12010","url":null,"abstract":"Significant work has been conducted in the domain of food computing, yet\u0000these studies typically focus on single tasks such as t2t (instruction\u0000generation from food titles and ingredients), i2t (recipe generation from food\u0000images), or t2i (food image generation from recipes). None of these approaches\u0000integrate all modalities simultaneously. To address this gap, we introduce a\u0000novel food computing foundation model that achieves true multimodality,\u0000encompassing tasks such as t2t, t2i, i2t, it2t, and t2ti. By leveraging large\u0000language models (LLMs) and pre-trained image encoder and decoder models, our\u0000model can perform a diverse array of food computing-related tasks, including\u0000food understanding, food recognition, recipe generation, and food image\u0000generation. Compared to previous models, our foundation model demonstrates a\u0000significantly broader range of capabilities and exhibits superior performance,\u0000particularly in food image generation and recipe generation tasks. We\u0000open-sourced ChefFusion at GitHub.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shiming Ge, Zhao Luo, Chunhui Zhang, Yingying Hua, Dacheng Tao
{"title":"Distilling Channels for Efficient Deep Tracking","authors":"Shiming Ge, Zhao Luo, Chunhui Zhang, Yingying Hua, Dacheng Tao","doi":"arxiv-2409.11785","DOIUrl":"https://doi.org/arxiv-2409.11785","url":null,"abstract":"Deep trackers have proven success in visual tracking. Typically, these\u0000trackers employ optimally pre-trained deep networks to represent all diverse\u0000objects with multi-channel features from some fixed layers. The deep networks\u0000employed are usually trained to extract rich knowledge from massive data used\u0000in object classification and so they are capable to represent generic objects\u0000very well. However, these networks are too complex to represent a specific\u0000moving object, leading to poor generalization as well as high computational and\u0000memory costs. This paper presents a novel and general framework termed channel\u0000distillation to facilitate deep trackers. To validate the effectiveness of\u0000channel distillation, we take discriminative correlation filter (DCF) and ECO\u0000for example. We demonstrate that an integrated formulation can turn feature\u0000compression, response map generation, and model update into a unified energy\u0000minimization problem to adaptively select informative feature channels that\u0000improve the efficacy of tracking moving objects on the fly. Channel\u0000distillation can accurately extract good channels, alleviating the influence of\u0000noisy channels and generally reducing the number of channels, as well as\u0000adaptively generalizing to different channels and networks. The resulting deep\u0000tracker is accurate, fast, and has low memory requirements. Extensive\u0000experimental evaluations on popular benchmarks clearly demonstrate the\u0000effectiveness and generalizability of our framework.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ORB-SfMLearner: ORB-Guided Self-supervised Visual Odometry with Selective Online Adaptation","authors":"Yanlin Jin, Rui-Yang Ju, Haojun Liu, Yuzhong Zhong","doi":"arxiv-2409.11692","DOIUrl":"https://doi.org/arxiv-2409.11692","url":null,"abstract":"Deep visual odometry, despite extensive research, still faces limitations in\u0000accuracy and generalizability that prevent its broader application. To address\u0000these challenges, we propose an Oriented FAST and Rotated BRIEF (ORB)-guided\u0000visual odometry with selective online adaptation named ORB-SfMLearner. We\u0000present a novel use of ORB features for learning-based ego-motion estimation,\u0000leading to more robust and accurate results. We also introduce the\u0000cross-attention mechanism to enhance the explainability of PoseNet and have\u0000revealed that driving direction of the vehicle can be explained through\u0000attention weights, marking a novel exploration in this area. To improve\u0000generalizability, our selective online adaptation allows the network to rapidly\u0000and selectively adjust to the optimal parameters across different domains.\u0000Experimental results on KITTI and vKITTI datasets show that our method\u0000outperforms previous state-of-the-art deep visual odometry methods in terms of\u0000ego-motion accuracy and generalizability.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Vision Transformers for Classification Tasks in Side-Scan Sonar Imagery","authors":"BW Sheffield, Jeffrey Ellen, Ben Whitmore","doi":"arxiv-2409.12026","DOIUrl":"https://doi.org/arxiv-2409.12026","url":null,"abstract":"Side-scan sonar (SSS) imagery presents unique challenges in the\u0000classification of man-made objects on the seafloor due to the complex and\u0000varied underwater environments. Historically, experts have manually interpreted\u0000SSS images, relying on conventional machine learning techniques with\u0000hand-crafted features. While Convolutional Neural Networks (CNNs) significantly\u0000advanced automated classification in this domain, they often fall short when\u0000dealing with diverse seafloor textures, such as rocky or ripple sand bottoms,\u0000where false positive rates may increase. Recently, Vision Transformers (ViTs)\u0000have shown potential in addressing these limitations by utilizing a\u0000self-attention mechanism to capture global information in image patches,\u0000offering more flexibility in processing spatial hierarchies. This paper\u0000rigorously compares the performance of ViT models alongside commonly used CNN\u0000architectures, such as ResNet and ConvNext, for binary classification tasks in\u0000SSS imagery. The dataset encompasses diverse geographical seafloor types and is\u0000balanced between the presence and absence of man-made objects. ViT-based models\u0000exhibit superior classification performance across f1-score, precision, recall,\u0000and accuracy metrics, although at the cost of greater computational resources.\u0000CNNs, with their inductive biases, demonstrate better computational efficiency,\u0000making them suitable for deployment in resource-constrained environments like\u0000underwater vehicles. Future research directions include exploring\u0000self-supervised learning for ViTs and multi-modal fusion to further enhance\u0000performance in challenging underwater environments.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models","authors":"Lorenzo Mandelli, Stefano Berretti","doi":"arxiv-2409.11920","DOIUrl":"https://doi.org/arxiv-2409.11920","url":null,"abstract":"In this paper, we address the challenge of generating realistic 3D human\u0000motions for action classes that were never seen during the training phase. Our\u0000approach involves decomposing complex actions into simpler movements,\u0000specifically those observed during training, by leveraging the knowledge of\u0000human motion contained in GPTs models. These simpler movements are then\u0000combined into a single, realistic animation using the properties of diffusion\u0000models. Our claim is that this decomposition and subsequent recombination of\u0000simple movements can synthesize an animation that accurately represents the\u0000complex input action. This method operates during the inference phase and can\u0000be integrated with any pre-trained diffusion model, enabling the synthesis of\u0000motion classes not present in the training data. We evaluate our method by\u0000dividing two benchmark human motion datasets into basic and complex actions,\u0000and then compare its performance against the state-of-the-art.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuowen Liang, Sisi Li, Qingyun Wang, Cen Zhang, Kaiquan Zhu, Tian Yang
{"title":"GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation","authors":"Shuowen Liang, Sisi Li, Qingyun Wang, Cen Zhang, Kaiquan Zhu, Tian Yang","doi":"arxiv-2409.11689","DOIUrl":"https://doi.org/arxiv-2409.11689","url":null,"abstract":"Pose skeleton images are an important reference in pose-controllable image\u0000generation. In order to enrich the source of skeleton images, recent works have\u0000investigated the generation of pose skeletons based on natural language. These\u0000methods are based on GANs. However, it remains challenging to perform diverse,\u0000structurally correct and aesthetically pleasing human pose skeleton generation\u0000with various textual inputs. To address this problem, we propose a framework\u0000with GUNet as the main model, PoseDiffusion. It is the first generative\u0000framework based on a diffusion model and also contains a series of variants\u0000fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates\u0000several desired properties that outperform existing methods. 1) Correct\u0000Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to\u0000incorporate graphical convolutional neural networks. It is able to learn the\u0000spatial relationships of the human skeleton by introducing skeletal information\u0000during the training process. 2) Diversity. We decouple the key points of the\u0000skeleton and characterise them separately, and use cross-attention to introduce\u0000textual conditions. Experimental results show that PoseDiffusion outperforms\u0000existing SoTA algorithms in terms of stability and diversity of text-driven\u0000pose skeleton generation. Qualitative analyses further demonstrate its\u0000superiority for controllable generation in Stable Diffusion.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MitoSeg: Mitochondria Segmentation Tool","authors":"Faris Serdar Taşel, Efe Çiftci","doi":"arxiv-2409.11974","DOIUrl":"https://doi.org/arxiv-2409.11974","url":null,"abstract":"Recent studies suggest a potential link between the physical structure of\u0000mitochondria and neurodegenerative diseases. With advances in Electron\u0000Microscopy techniques, it has become possible to visualize the boundary and\u0000internal membrane structures of mitochondria in detail. It is crucial to\u0000automatically segment mitochondria from these images to investigate the\u0000relationship between mitochondria and diseases. In this paper, we present a\u0000software solution for mitochondrial segmentation, highlighting mitochondria\u0000boundaries in electron microscopy tomography images and generating\u0000corresponding 3D meshes.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation","authors":"Dimitrios Christodoulou, Mads Kuhlmann-Jørgensen","doi":"arxiv-2409.11904","DOIUrl":"https://doi.org/arxiv-2409.11904","url":null,"abstract":"Efficiently evaluating the performance of text-to-image models is difficult\u0000as it inherently requires subjective judgment and human preference, making it\u0000hard to compare different models and quantify the state of the art. Leveraging\u0000Rapidata's technology, we present an efficient annotation framework that\u0000sources human feedback from a diverse, global pool of annotators. Our study\u0000collected over 2 million annotations across 4,512 images, evaluating four\u0000prominent models (DALL-E 3, Flux.1, MidJourney, and Stable Diffusion) on style\u0000preference, coherence, and text-to-image alignment. We demonstrate that our\u0000approach makes it feasible to comprehensively rank image generation models\u0000based on a vast pool of annotators and show that the diverse annotator\u0000demographics reflect the world population, significantly decreasing the risk of\u0000biases.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}