Ana Filipa Rodrigues Nogueira , Hélder P. Oliveira , Luís F. Teixeira
{"title":"Markerless multi-view 3D human pose estimation: A survey","authors":"Ana Filipa Rodrigues Nogueira , Hélder P. Oliveira , Luís F. Teixeira","doi":"10.1016/j.imavis.2025.105437","DOIUrl":"10.1016/j.imavis.2025.105437","url":null,"abstract":"<div><div>3D human pose estimation aims to reconstruct the human skeleton of all the individuals in a scene by detecting several body joints. The creation of accurate and efficient methods is required for several real-world applications including animation, human–robot interaction, surveillance systems or sports, among many others. However, several obstacles such as occlusions, random camera perspectives, or the scarcity of 3D labelled data, have been hampering the models’ performance and limiting their deployment in real-world scenarios. The higher availability of cameras has led researchers to explore multi-view solutions due to the advantage of being able to exploit different perspectives to reconstruct the pose.</div><div>Most existing reviews focus mainly on monocular 3D human pose estimation and a comprehensive survey only on multi-view approaches to determine the 3D pose has been missing since 2012. Thus, the goal of this survey is to fill that gap and present an overview of the methodologies related to 3D pose estimation in multi-view settings, understand what were the strategies found to address the various challenges and also, identify their limitations. According to the reviewed articles, it was possible to find that most methods are fully-supervised approaches based on geometric constraints. Nonetheless, most of the methods suffer from 2D pose mismatches, to which the incorporation of temporal consistency and depth information have been suggested to reduce the impact of this limitation, besides working directly with 3D features can completely surpass this problem but at the expense of higher computational complexity. Models with lower supervision levels were identified to overcome some of the issues related to 3D pose, particularly the scarcity of labelled datasets. Therefore, no method is yet capable of solving all the challenges associated with the reconstruction of the 3D pose. Due to the existing trade-off between complexity and performance, the best method depends on the application scenario. Therefore, further research is still required to develop an approach capable of quickly inferring a highly accurate 3D pose with bearable computation cost. To this goal, techniques such as active learning, methods that learn with a low level of supervision, the incorporation of temporal consistency, view selection, estimation of depth information and multi-modal approaches might be interesting strategies to keep in mind when developing a new methodology to solve this task.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105437"},"PeriodicalIF":4.2,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143429777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Freire-Obregón , Joao Neves , Žiga Emeršič , Blaž Meden , Modesto Castrillón-Santana , Hugo Proença
{"title":"Synthesizing multilevel abstraction ear sketches for enhanced biometric recognition","authors":"David Freire-Obregón , Joao Neves , Žiga Emeršič , Blaž Meden , Modesto Castrillón-Santana , Hugo Proença","doi":"10.1016/j.imavis.2025.105424","DOIUrl":"10.1016/j.imavis.2025.105424","url":null,"abstract":"<div><div>Sketch understanding poses unique challenges for general-purpose vision algorithms due to the sparse and semantically ambiguous nature of sketches. This paper introduces a novel approach to biometric recognition that leverages sketch-based representations of ears, a largely unexplored but promising area in biometric research. Specifically, we address the “<em>sketch-2-image</em>” matching problem by synthesizing ear sketches at multiple abstraction levels, achieved through a triplet-loss function adapted to integrate these levels. The abstraction level is determined by the number of strokes used, with fewer strokes reflecting higher abstraction. Our methodology combines sketch representations across abstraction levels to improve robustness and generalizability in matching. Extensive evaluations were conducted on four ear datasets (AMI, AWE, IITDII, and BIPLab) using various pre-trained neural network backbones, showing consistently superior performance over state-of-the-art methods. These results highlight the potential of ear sketch-based recognition, with cross-dataset tests confirming its adaptability to real-world conditions and suggesting applicability beyond ear biometrics.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105424"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiangle feature fusion network for style transfer","authors":"Zhenshan Hu, Bin Ge, Chenxing Xia","doi":"10.1016/j.imavis.2024.105386","DOIUrl":"10.1016/j.imavis.2024.105386","url":null,"abstract":"<div><div>In recent years, arbitrary style transfer has gained a lot of attention from researchers. Although existing methods achieve good results, the generated images are usually biased towards styles, resulting in images with artifacts and repetitive patterns. To address the above problems, we propose a multi-angle feature fusion network for style transfer (MAFST). MAFST consists of a Multi-Angle Feature Fusion module (MAFF), a Multi-Scale Style Capture module (MSSC), multi-angle loss, and a content temporal consistency loss. MAFF can process the captured features from channel level and pixel level, and feature fusion is performed both locally and globally. MSSC processes the shallow style features and optimize generated images. To guide the model to focus on local features, we introduce a multi-angle loss. The content temporal consistency loss extends image style transfer to video style transfer. Extensive experiments have demonstrated that our proposed MAFST can effectively avoid images with artifacts and repetitive patterns. MAFST achieves advanced performance.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105386"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rajat Kumar Arya, Siddhant Jain, Pratik Chattopadhyay, Rajeev Srivastava
{"title":"HSIRMamba: An effective feature learning for hyperspectral image classification using residual Mamba","authors":"Rajat Kumar Arya, Siddhant Jain, Pratik Chattopadhyay, Rajeev Srivastava","doi":"10.1016/j.imavis.2024.105387","DOIUrl":"10.1016/j.imavis.2024.105387","url":null,"abstract":"<div><div>Deep learning models have recently demonstrated outstanding results in classifying hyperspectral images (HSI). The Transformer model is among the various deep learning models that have received increasing interest due to its superior ability to simulate the long-term dependence of spatial-spectral information in HSI. Due to its self-attention mechanism, the Transformer exhibits quadratic computational complexity, which makes it heavier than other models and limits its application in the processing of HSI. Fortunately, the newly developed state space model Mamba exhibits excellent computing effectiveness and achieves Transformer-like modeling capabilities. Therefore, we propose a novel enhanced Mamba-based model called HSIRMamba that integrates residual operations into the Mamba architecture by combining the power of Mamba and the residual network to extract the spectral properties of HSI more effectively. It also includes a concurrent dedicated block for spatial analysis using a convolutional neural network. HSIRMamba extracts more accurate features with low computational power, making it more powerful than transformer-based models. HSIRMamba was tested on three majorly used HSI Datasets-Indian Pines, Pavia University, and Houston 2013. The experimental results demonstrate that the proposed method achieves competitive results compared to state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105387"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CPFusion: A multi-focus image fusion method based on closed-loop regularization","authors":"Hao Zhai, Peng Chen, Nannan Luo, Qinyu Li, Ping Yu","doi":"10.1016/j.imavis.2024.105399","DOIUrl":"10.1016/j.imavis.2024.105399","url":null,"abstract":"<div><div>The purpose of Multi-Focus Image Fusion (MFIF) is to extract the clear portions from multiple blurry images with complementary features to obtain a fully focused image, which is considered a prerequisite for other advanced visual tasks. With the development of deep learning technologies, significant breakthroughs have been achieved in multi-focus image fusion. However, most existing methods still face challenges related to detail information loss and misjudgment in boundary regions. In this paper, we propose a method called CPFusion for MFIF. On one hand, to fully preserve all detail information from the source images, we utilize an Invertible Neural Network (INN) for feature information transfer. The strong feature retention capability of INN allows for better preservation of the complementary features of the source images. On the other hand, to enhance the network’s performance in image fusion, we design a closed-loop structure to guide the fusion process. Specifically, during the training process, the forward operation of the network is used to learn the mapping from source images to fused images and decision maps, while the backward operation simulates the degradation of the focused image back to the source images. The backward operation serves as an additional constraint to guide the performance of the network’s forward operation. To achieve more natural fusion results, our network simultaneously generates an initial fused image and a decision map, utilizing the decision map to retain the details of the source images, while the initial fused image is employed to improve the visual effects of the decision map fusion method in boundary regions. Extensive experimental results demonstrate that the proposed method achieves excellent results in both subjective visual quality and objective metric assessments.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105399"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peng Zan , Yuerong Wang , Haohao Hu , Wanjun Zhong , Tianyu Han , Jingwei Yue
{"title":"An Active Transfer Learning framework for image classification based on Maximum Differentiation Classifier","authors":"Peng Zan , Yuerong Wang , Haohao Hu , Wanjun Zhong , Tianyu Han , Jingwei Yue","doi":"10.1016/j.imavis.2024.105401","DOIUrl":"10.1016/j.imavis.2024.105401","url":null,"abstract":"<div><div>Deep learning has been extensively adopted across various domains, yielding satisfactory outcomes. However, it heavily relies on extensive labeled datasets, collecting data labels is expensive and time-consuming. We propose a novel framework called Active Transfer Learning (ATL) to address this issue. The ATL framework consists of Active Learning (AL) and Transfer Learning (TL). AL queries the unlabeled samples with high inconsistency by Maximum Differentiation Classifier (MDC). The MDC pulls the discrepancy between the labeled data and their augmentations to select and annotate the informative samples. Additionally, we also explore the potential of incorporating TL techniques. The TL comprises pre-training and fine-tuning. The former learns knowledge from the origin-augmentation domain to pre-train the model, while the latter leverages the acquired knowledge for the downstream tasks. The results indicate that the combination of TL and AL exhibits complementary effects, while the proposed ATL framework outperforms state-of-the-art methods in terms of accuracy, precision, recall, and F1-score.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105401"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unmasking deepfakes: Eye blink pattern analysis using a hybrid LSTM and MLP-CNN model","authors":"Ruchika Sharma, Rudresh Dwivedi","doi":"10.1016/j.imavis.2024.105370","DOIUrl":"10.1016/j.imavis.2024.105370","url":null,"abstract":"<div><div>Recent progress in the field of computer vision incorporates robust tools for creating convincing deepfakes. Hence, the propagation of fake media may have detrimental effects on social communities, potentially tarnishing the reputation of individuals or groups. Furthermore, this phenomenon may manipulate public sentiments and skew opinions about the affected entities. Recent research determines Convolution Neural Networks (CNNs) as a viable solution for detecting deepfakes within the networks. However, existing techniques struggle to accurately capture the differences between frames in the collected media streams. To alleviate these limitations, our work proposes a new Deepfake detection approach using a hybrid model using the Multi-layer Perceptron Convolution Neural Network (MLP-CNN) model and LSTM (Long Short Term Memory). Our model has utilized Contrast Limited Adaptive Histogram Equalization (CLAHE) (Musa et al., 2018) approach to enhance the contrast of the image and later on applying Viola Jones Algorithm (VJA) (Paul et al., 2018) to the preprocessed image for detecting the face. The extracted features such as Improved eye blinking pattern detection (IEBPD), active shape model (ASM), face attributes, and eye attributes features along with the age and gender of the corresponding image are fed to the hybrid deepfake detection model that involves two classifiers MLP-CNN and LSTM model. The proposed model is trained with these features to detect the deepfake images proficiently. The experimentation demonstrates that our proposed hybrid model has been evaluated on two datasets, i.e. World Leader Dataset (WLDR) and the DeepfakeTIMIT Dataset. From the experimental results, it is affirmed that our proposed hybrid model outperforms existing approaches such as DeepVision, DNN (Deep Neutral Network), CNN (Convolution Neural Network), RNN (Recurrent Neural network), DeepMaxout, DBN (Deep Belief Networks), and Bi-GRU (Bi-Directional Gated Recurrent Unit).</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105370"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meiying Gu , Jiahe Li , Yuchen Wu , Haonan Luo , Jin Zheng , Xiao Bai
{"title":"3D human avatar reconstruction with neural fields: A recent survey","authors":"Meiying Gu , Jiahe Li , Yuchen Wu , Haonan Luo , Jin Zheng , Xiao Bai","doi":"10.1016/j.imavis.2024.105341","DOIUrl":"10.1016/j.imavis.2024.105341","url":null,"abstract":"<div><div>3D human avatar reconstruction aims to reconstruct the 3D geometric shape and appearance of the human body from various data inputs, such as images, videos, and depth information, acting as a key component in human-oriented 3D vision in the metaverse. With the progress in neural fields for 3D reconstruction in recent years, significant advancements have been made in this research area for shape accuracy and appearance quality. Meanwhile, substantial efforts on dynamic avatars with the representation of neural fields have exhibited their effect. Although significant improvements have been achieved, challenges still exist in in-the-wild and complex environments, detailed shape recovery, and interactivity in real-world applications. In this survey, we present a comprehensive overview of 3D human avatar reconstruction methods using advanced neural fields. We start by introducing the background of 3D human avatar reconstruction and the mainstream paradigms with neural fields. Subsequently, representative research studies are classified based on their representation and avatar partswith detailed discussion. Moreover, we summarize the commonly used available datasets, evaluation metrics, and results in the research area. In the end, we discuss the open problems and highlight the promising future directions, hoping to inspire novel ideas and promote further research in this area.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105341"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingjie Jin , Xiaofei Zhou , Zhenjie Zhang , Hao Fang , Ran Shi , Xiaobin Xu
{"title":"Hierarchical spatiotemporal Feature Interaction Network for video saliency prediction","authors":"Yingjie Jin , Xiaofei Zhou , Zhenjie Zhang , Hao Fang , Ran Shi , Xiaobin Xu","doi":"10.1016/j.imavis.2025.105413","DOIUrl":"10.1016/j.imavis.2025.105413","url":null,"abstract":"<div><div>Transformer can build effective long-range dependency relationships and has been effectively utilized for video saliency prediction. However, fewer works have been devoted to the design of Transformer-based models for video saliency prediction. Furthermore, the existing Transformer-based models do not sufficiently explore multi-level Transformer features. To address this limitation, we present a novel Hierarchical Spatiotemporal Feature Interaction Network (<em>i.e.</em>, HSFI-Net), which involves three crucial steps, namely multi-scale feature integration, hierarchical feature enhancement, and semantic-guided saliency prediction. Firstly, the multi-level Transformer-based spatiotemporal features are merged step by step using the multi-scale feature integration (MFI) units. Particularly, each MFI unit successively splits and cross-concatenation of features, promoting the interaction of different-level features. Furthermore, it endows features with multi-scale temporal receptive fields via different time-size kernel-based 3D convolutions. Secondly, the temporal-extended feature enhancement (TFE) unit and channel-correlated feature enhancement (CFE) unit are deployed to conduct hierarchical feature enhancement. Here, the TFE unit and the CFE unit learn rich contextual information from the temporal and channel dimensions respectively, providing powerful representations for visual attention regions in videos. Lastly, we design the semantic-guided saliency prediction (SSP) module to consolidate multi-level spatiotemporal features into the final saliency map, where the semantic information serves as a filter for purifying the fused spatiotemporal feature. We conduct extensive experiments on four challenging video saliency datasets, including DHF1K, Hollywood-2, UCF, and DIEM. The experimental results clearly demonstrate that our saliency model outperforms state-of-the-art methods. The code is available at <span><span>https://github.com/JYJPush/HSFI-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105413"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuezhi Xiang , Xiankun Zhou , Yingxin Wei , Xi Wang , Yulong Qiao
{"title":"Scene flow estimation from point cloud based on grouped relative self-attention","authors":"Xuezhi Xiang , Xiankun Zhou , Yingxin Wei , Xi Wang , Yulong Qiao","doi":"10.1016/j.imavis.2024.105368","DOIUrl":"10.1016/j.imavis.2024.105368","url":null,"abstract":"<div><div>3D scene flow estimation is a fundamental task in computer vision, which aims to estimate the 3D motions of point clouds. The point cloud is disordered, and the point density in the local area of the same object is non-uniform. The features extracted by previous methods are not discriminative enough to obtain accurate scene flow. Besides, scene flow may be misestimated when two adjacent frames of point clouds have large movements. From our observation, the quality of point cloud feature extraction and the correlations of two-frame point clouds directly affect the accuracy of scene flow estimation. Therefore, we propose an improved self-attention structure named Grouped Relative Self-Attention (GRSA) that simultaneously utilizes the grouping operation and offset subtraction operation with normalization refinement to process point clouds. Specifically, we embed the Grouped Relative Self-Attention (GRSA) into feature extraction and each stage of flow refinement to gain lightweight but efficient self-attention respectively, which can extract discriminative point features and enhance the point correlations to be more adaptable to long-distance movements. Furthermore, we use a comprehensive loss function to avoid outliers and obtain robust results. We evaluate our method on the FlyingThings3D and KITTI datasets and achieve superior performance. In particular, our method outperforms all other methods on the FlyingThings3D dataset, where Outliers achieves a 16.9% improvement. On the KITTI dataset, Outliers also achieves a 6.7% improvement.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105368"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}