Feng Hao, Fujin Zhong, Yunhe Wang, Hong Yu, Jun Hu, Yan Yang
{"title":"Corrigendum to “STAFFormer: Spatio-temporal adaptive fusion transformer for efficient 3D human pose estimation” [Journal of Image and Vision Computing volume 149 (2024) 105142]","authors":"Feng Hao, Fujin Zhong, Yunhe Wang, Hong Yu, Jun Hu, Yan Yang","doi":"10.1016/j.imavis.2024.105305","DOIUrl":"10.1016/j.imavis.2024.105305","url":null,"abstract":"","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105305"},"PeriodicalIF":4.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaojie Zhang , Weijun Wang , Tianlun Huang , Zhiyong Wang , Wei Feng
{"title":"SVC: Sight view constraint for robust point cloud registration","authors":"Yaojie Zhang , Weijun Wang , Tianlun Huang , Zhiyong Wang , Wei Feng","doi":"10.1016/j.imavis.2024.105315","DOIUrl":"10.1016/j.imavis.2024.105315","url":null,"abstract":"<div><div>Partial to Partial Point Cloud Registration (partial PCR) remains a challenging task, particularly when dealing with a low overlap rate. In comparison to the full-to-full registration task, we find that the objective of partial PCR is still not well-defined, indicating no metric can reliably identify the true transformation. We identify this as the most fundamental challenge in partial PCR tasks. In this paper, instead of directly seeking the optimal transformation, we propose a novel and general Sight View Constraint (SVC) to conclusively identify incorrect transformations, thereby enhancing the robustness of existing PCR methods. Extensive experiments validate the effectiveness of SVC on both indoor and outdoor scenes. On the challenging 3DLoMatch dataset, our approach increases the registration recall from 78% to 82%, achieving the state-of-the-art result. This research also highlights the significance of the decision version problem of partial PCR, which has the potential to provide novel insights into the partial PCR problem. Code will be available at: <span><span>https://github.com/pppyj-m/SVC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105315"},"PeriodicalIF":4.2,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rafael Berral-Soler , Rafael Muñoz-Salinas , Rafael Medina-Carnicer , Manuel J. Marín-Jiménez
{"title":"DeepArUco++: Improved detection of square fiducial markers in challenging lighting conditions","authors":"Rafael Berral-Soler , Rafael Muñoz-Salinas , Rafael Medina-Carnicer , Manuel J. Marín-Jiménez","doi":"10.1016/j.imavis.2024.105313","DOIUrl":"10.1016/j.imavis.2024.105313","url":null,"abstract":"<div><div>Fiducial markers are a computer vision tool used for object pose estimation and detection. These markers are highly useful in fields such as industry, medicine and logistics. However, optimal lighting conditions are not always available, and other factors such as blur or sensor noise can affect image quality. Classical computer vision techniques that precisely locate and decode fiducial markers often fail under difficult illumination conditions (e.g. extreme variations of lighting within the same frame). Hence, we propose DeepArUco++, a deep learning-based framework that leverages the robustness of Convolutional Neural Networks to perform marker detection and decoding in challenging lighting conditions. The framework is based on a pipeline using different Neural Network models at each step, namely marker detection, corner refinement and marker decoding. Additionally, we propose a simple method for generating synthetic data for training the different models that compose the proposed pipeline, and we present a second, real-life dataset of ArUco markers in challenging lighting conditions used to evaluate our system. The developed method outperforms other state-of-the-art methods in such tasks and remains competitive even when testing on the datasets used to develop those methods. Code available in GitHub: <span><span>https://github.com/AVAuco/deeparuco/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105313"},"PeriodicalIF":4.2,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142655973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiao Ma , Zetian Zhang , Rong Yu , Zexuan Ji , Mingchao Li , Yuhan Zhang , Qiang Chen
{"title":"SAVE: Encoding spatial interactions for vision transformers","authors":"Xiao Ma , Zetian Zhang , Rong Yu , Zexuan Ji , Mingchao Li , Yuhan Zhang , Qiang Chen","doi":"10.1016/j.imavis.2024.105312","DOIUrl":"10.1016/j.imavis.2024.105312","url":null,"abstract":"<div><div>Transformers have achieved impressive performance in visual tasks. Position encoding, which equips vectors (elements of input tokens, queries, keys, or values) with sequence specificity, effectively alleviates the lack of permutation relation in transformers. In this work, we first clarify that both position encoding and additional position-specific operations will introduce positional information when participating in self-attention. On this basis, most existing position encoding methods are equivalent to special affine transformations. However, this encoding method lacks the correlation of vector content interaction. We further propose Spatial Aggregation Vector Encoding (SAVE) that employs transition matrices to recombine vectors. We design two simple yet effective modes to merge other vectors, with each one serving as an anchor. The aggregated vectors control spatial contextual connections by establishing two-dimensional relationships. Our SAVE can be plug-and-play in vision transformers, even with other position encoding methods. Comparative results on three image classification datasets show that the proposed SAVE performs comparably to current position encoding methods. Experiments on detection tasks show that the SAVE improves the downstream performance of transformer-based methods. Code is available at <span><span>https://github.com/maxiao0234/SAVE</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105312"},"PeriodicalIF":4.2,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"3DPSR: An innovative approach for pose and shape refinement in 3D human meshes from a single 2D image","authors":"Mohit Kushwaha, Jaytrilok Choudhary , Dhirendra Pratap Singh","doi":"10.1016/j.imavis.2024.105311","DOIUrl":"10.1016/j.imavis.2024.105311","url":null,"abstract":"<div><div>In the era of computer vision, 3D human models are gaining a lot of interest in the gaming industry, cloth parsing, avatar creations, and many more applications. In these fields, having a precise 3D human model with accurate shape and pose is crucial for realistic and high-quality results. We proposed an approach called 3DPSR that uses a single 2D image and reconstructs precise 3D human meshes with better alignment of pose and shape. 3DPSR is referred to as <strong>3D P</strong>ose and <strong>S</strong>hape <strong>R</strong>efinements. 3DPSR contains two modules (mesh deformation using pose-fitting and shape-fitting), in which mesh deformation using shape-fitting acts as a refinement module. Compared to existing methods, the proposed method, 3DPSR, delivers more enhanced MPVE and PA-MPJPE results, as well as more accurate 3D models of humans. 3DPSR significantly outperforms state-of-the-art human mesh reconstruction methods on challenging and standard datasets such as SURREAL, Human3.6M, and 3DPW across different scenarios with complex poses, establishing a new benchmark.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105311"},"PeriodicalIF":4.2,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142586816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Shu , Qi Wu , Liang Tan , Xinyi Shu , Fengchun Wan
{"title":"CWGA-Net: Center-Weighted Graph Attention Network for 3D object detection from point clouds","authors":"Jun Shu , Qi Wu , Liang Tan , Xinyi Shu , Fengchun Wan","doi":"10.1016/j.imavis.2024.105314","DOIUrl":"10.1016/j.imavis.2024.105314","url":null,"abstract":"<div><div>The precision of 3D object detection from unevenly distributed outdoor point clouds is critical in autonomous driving perception systems. Current point-based detectors employ self-attention and graph convolution to establish contextual relationships between point clouds; however, they often introduce weakly correlated redundant information, leading to blurred geometric details and false detections. To address this issue, a novel Center-weighted Graph Attention Network (CWGA-Net) has been proposed to fuse geometric and semantic similarities for weighting cross-attention scores, thereby capturing precise fine-grained geometric features. CWGA-Net initially constructs and encodes local graphs between foreground points, establishing connections between point clouds from geometric and semantic dimensions. Subsequently, center-weighted cross-attention is utilized to compute the contextual relationships between vertices within the graph, and geometric and semantic similarities between vertices are fused to weight attention scores, thereby extracting strongly related geometric shape features. Finally, a cross-feature fusion Module is introduced to deeply fuse high and low-resolution features to compensate for the information loss during downsampling. Experiments conducted on the KITTI and Waymo datasets demonstrate that the network achieves superior detection capabilities, outperforming state-of-the-art point-based single-stage methods in terms of average precision metrics while maintaining good speed.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105314"},"PeriodicalIF":4.2,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142571750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Occlusion-related graph convolutional neural network for multi-object tracking","authors":"Yubo Zhang , Liying Zheng , Qingming Huang","doi":"10.1016/j.imavis.2024.105317","DOIUrl":"10.1016/j.imavis.2024.105317","url":null,"abstract":"<div><div>Multi-Object Tracking (MOT) has recently been improved by Graph Convolutional Neural Networks (GCNNs) for its good performance in characterizing interactive features. However, GCNNs prefer assigning smaller proportions to node features if a node has more neighbors, presenting challenges in distinguishing objects with similar neighbors which is common in dense scenes. This paper designs an Occlusion-Related GCNN (OR-GCNN) based on which an interactive similarity module is further built. Specifically, the interactive similarity module first uses learnable weights to calculate the edge weights between tracklets and detection objects, which balances the appearance cosine similarity and Intersection over Union (IoU). Then, the module determines the proportion of node features with the help of an occlusion weight comes from a MultiLayer Perceptron (MLP). These occlusion weights, the edge weights, and the node features are then served to our OR-GCNN to obtain interactive features. Finally, by integrating interactive similarity into a common MOT framework, such as BoT-SORT, one gets a tracker that efficiently alleviates the issues in dense MOT task. The experimental results on MOT16 and MOT17 benchmarks show that our model achieves the MOTA of 80.6 and 81.1 and HOTA of 65.3 and 65.1 on MOT16 and MOT17, respectively, which outperforms the state-of-the-art trackers, including ByteTrack, BoT-SORT, GCNNMatch, GNMOT, and GSM.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105317"},"PeriodicalIF":4.2,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142571749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A multi-label classification method based on transformer for deepfake detection","authors":"Liwei Deng , Yunlong Zhu , Dexu Zhao , Fei Chen","doi":"10.1016/j.imavis.2024.105319","DOIUrl":"10.1016/j.imavis.2024.105319","url":null,"abstract":"<div><div>With the continuous development of hardware and deep learning technologies, existing forgery techniques are capable of more refined facial manipulations, making detection tasks increasingly challenging. Therefore, forgery detection cannot be viewed merely as a traditional binary classification task. To achieve finer forgery detection, we propose a method based on multi-label detection classification capable of identifying the presence of forgery in multiple facial components. Initially, the dataset undergoes preprocessing to meet the requirements of this task. Subsequently, we introduce a Detail-Enhancing Attention Module into the network to amplify subtle forgery traces in shallow feature maps and enhance the network's feature extraction capabilities. Additionally, we employ a Global–Local Transformer Decoder to improve the network's ability to focus on local information. Finally, extensive experiments demonstrate that our approach achieves 92.45% mAP and 90.23% mAUC, enabling precise detection of facial components in images, thus validating the effectiveness of our proposed method.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105319"},"PeriodicalIF":4.2,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MinoritySalMix and adaptive semantic weight compensation for long-tailed classification","authors":"Wu Zeng, Zheng-ying Xiao","doi":"10.1016/j.imavis.2024.105307","DOIUrl":"10.1016/j.imavis.2024.105307","url":null,"abstract":"<div><div>In real-world datasets, the widespread presence of a long-tailed distribution often leads models to become overly biased towards majority class samples while ignoring minority class samples. We propose a strategy called MASW (MinoritySalMix and adaptive semantic weight compensation) to improve this problem. First, we propose a data augmentation method called MinoritySalMix (minority-saliency-mixing), which uses significance detection techniques to select significant regions from minority class samples as cropping regions and paste them into the same regions of majority class samples to generate brand new samples, thereby amplifying images containing important regions of minority class samples. Second, in order to make the label value information of the newly generated samples more consistent with the image content of the newly generated samples, we propose an adaptive semantic compensation factor. This factor provides more label value compensation for minority samples based on the different cropping areas, thereby making the new label values closer to the content of the newly generated samples. Improve model performance by generating more accurate new label value information. Finally, considering that some current re-sampling strategies generally lack flexibility in handling class sampling weight allocation and frequently require manual adjustment. We designed an adaptive weight function and incorporated it into the re-sampling strategy to achieve better sampling. The experimental results on three long-tailed datasets show that our method can effectively improve the performance of the model and is superior to most advanced long-tailed methods. Furthermore, we extended MinoritySalMix’s strategy to three balanced datasets for experimentation, and the results indicated that our method surpassed several advanced data augmentation techniques.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105307"},"PeriodicalIF":4.2,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}