{"title":"Search and recovery network for camouflaged object detection","authors":"Guangrui Liu, Wei Wu","doi":"10.1016/j.imavis.2024.105247","DOIUrl":"10.1016/j.imavis.2024.105247","url":null,"abstract":"<div><p>Camouflaged object detection aims to accurately identify objects blending into the background. However, existing methods often struggle, especially with small object or multiple objects, due to their reliance on singular strategies. To address this, we introduce a novel Search and Recovery Network (SRNet) using a bionic approach and auxiliary features. SRNet comprises three key modules: the Region Search Module (RSM), Boundary Recovery Module (BRM), and Camouflaged Object Predictor (COP). The RSM mimics predator behavior to locate potential object regions, enhancing object location detection. The BRM refines texture features and recovers object boundaries. The COP fuse multilevel features to predict final segmentation maps. Experimental results on three benchmark datasets show SRNet's superiority over SOTA models, particularly with small and multiple objects. Notably, SRNet achieves performance improvements without significantly increasing model parameters. Moreover, the method exhibits promising performance in downstream tasks such as defect detection, polyp segmentation and military camouflage detection.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105247"},"PeriodicalIF":4.2,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Haris Kaka Khel , Paul Greaney , Marion McAfee , Sandra Moffett , Kevin Meehan
{"title":"GSTGM: Graph, spatial–temporal attention and generative based model for pedestrian multi-path prediction","authors":"Muhammad Haris Kaka Khel , Paul Greaney , Marion McAfee , Sandra Moffett , Kevin Meehan","doi":"10.1016/j.imavis.2024.105245","DOIUrl":"10.1016/j.imavis.2024.105245","url":null,"abstract":"<div><p>Pedestrian trajectory prediction in urban environments has emerged as a critical research area with extensive applications across various domains. Accurate prediction of pedestrian trajectories is essential for the safe navigation of autonomous vehicles and robots in pedestrian-populated environments. Effective prediction models must capture both the spatial interactions among pedestrians and the temporal dependencies governing their movements. Existing research primarily focuses on forecasting a single trajectory per pedestrian, limiting its applicability in real-world scenarios characterised by diverse and unpredictable pedestrian behaviours. To address these challenges, this paper introduces the Graph Convolutional Network, Spatial–Temporal Attention, and Generative Model (GSTGM) for pedestrian trajectory prediction. GSTGM employs a spatiotemporal graph convolutional network to effectively capture complex interactions between pedestrians and their environment. Additionally, it integrates a spatial–temporal attention mechanism to prioritise relevant information during the prediction process. By incorporating a time-dependent prior within the latent space and utilising a computationally efficient generative model, GSTGM facilitates the generation of diverse and realistic future trajectories. The effectiveness of GSTGM is validated through experiments on real-world scenario datasets. Compared to the state-of-the-art models on benchmark datasets such as ETH/UCY, GSTGM demonstrates superior performance in accurately predicting multiple potential trajectories for individual pedestrians. This superiority is measured using metrics such as Final Displacement Error (FDE) and Average Displacement Error (ADE). Moreover, GSTGM achieves these results with significantly faster processing speeds.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105245"},"PeriodicalIF":4.2,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0262885624003500/pdfft?md5=be799dd771bacffe5a12fc1424240e2d&pid=1-s2.0-S0262885624003500-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Li , Sile Ma , Xiangyuan Jiang , Yizhong Luan , Zecui Jiang
{"title":"Probability based dynamic soft label assignment for object detection","authors":"Yi Li , Sile Ma , Xiangyuan Jiang , Yizhong Luan , Zecui Jiang","doi":"10.1016/j.imavis.2024.105240","DOIUrl":"10.1016/j.imavis.2024.105240","url":null,"abstract":"<div><p>By defining effective supervision labels for network training, the performance of object detectors can be improved without incurring additional inference costs. Current label assignment strategies generally require two steps: first, constructing a positive sample candidate bag, and then designing labels for these samples. However, the construction of candidate bag of positive samples may result in some noisy samples being introduced into the label assignment process. We explore a single-step label assignment approach: directly generating a probability map as labels for all samples. We design the label assignment approach from the following perspectives: Firstly, it should be able to reduce the impact of noise samples. Secondly, each sample should be treated differently because each one matches the target to a different extent, which assists the network to learn more valuable information from high-quality samples. We propose a probability-based dynamic soft label assignment method. Instead of dividing the samples into positive and negative samples, a probability map, which is calculated based on prediction quality and prior knowledge, is used to supervise all anchor points of the classification branch. The weight of prior knowledge in the labels decreases as the network improves the quality of instance predictions, as a way to reduce noise samples introduced by prior knowledge. By using continuous probability values as labels to supervise the classification branch, the network is able to focus on high-quality samples. As demonstrated in the experiments on the MS COCO benchmark, our label assignment method achieves 40.9% AP in the ResNet-50 under 1x schedule, which improves FCOS performance by approximately 2.0% AP. The code has been available at <span><span><span>https://github.com/Liyi4578/PDSLA</span></span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105240"},"PeriodicalIF":4.2,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CRENet: Crowd region enhancement network for multi-person 3D pose estimation","authors":"Zhaokun Li, Qiong Liu","doi":"10.1016/j.imavis.2024.105243","DOIUrl":"10.1016/j.imavis.2024.105243","url":null,"abstract":"<div><p>Recovering multi-person 3D poses from a single image is a challenging problem due to inherent depth ambiguities, including root-relative depth and absolute root depth. Current bottom-up methods show promising potential to mitigate absolute root depth ambiguity through explicitly aggregating global contextual cues. However, these methods treat the entire image region equally during root depth regression, ignoring the negative impact of irrelevant regions. Moreover, they learn shared features for both depths, each of which focuses on different information. This sharing mechanism may result in negative transfer, thus diminishing root depth prediction accuracy. To address these challenges, we present a novel bottom-up method, Crowd Region Enhancement Network (CRENet), incorporating a Feature Decoupling Module (FDM) and a Global Attention Module (GAM). FDM explicitly learns the discriminative feature for each depth through adaptively recalibrating its channel-wise responses and fusing multi-level features, which makes the model focus on each depth prediction separately and thus avoids the adverse effect of negative transfer. GAM highlights crowd regions while suppressing irrelevant regions using the attention mechanism and further refines the attention regions based on the confidence measure about the attention, which is beneficial to learn depth-related cues from informative crowd regions and facilitate root depth estimation. Comprehensive experiments on benchmarks MuPoTS-3D and CMU Panoptic demonstrate that our method outperforms the state-of-the-art bottom-up methods in absolute 3D pose estimation and is applicable to in-the-wild images, which also indicates that learning depth-specific features and suppressing the noise signals can significantly benefit multi-person absolute 3D pose estimation.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105243"},"PeriodicalIF":4.2,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual subspace clustering for spectral-spatial hyperspectral image clustering","authors":"Shujun Liu","doi":"10.1016/j.imavis.2024.105235","DOIUrl":"10.1016/j.imavis.2024.105235","url":null,"abstract":"<div><p>Subspace clustering supposes that hyperspectral image (HSI) pixels lie in a union vector spaces of multiple sample subspaces without considering their dual space, i.e., spectral space. In this article, we propose a promising dual subspace clustering (DualSC) for improving spectral-spatial HSIs clustering by relaxing subspace clustering. To this end, DualSC simultaneously optimizes row and column subspace-representations of HSI superpixels to capture the intrinsic connection between spectral and spatial information. From the new perspective, the original subspace clustering can be treated as a special case of DualSC that has larger solution space, so tends to finding better sample representation matrix for applying spectral clustering. Besides, we provide theoretical proofs that show the proposed method relaxes the subspace space clustering with dual subspace, and can recover subspace-sparse representation of HSI samples. To the best of our knowledge, this work could be one of the first dual clustering method leveraging sample and spectral subspaces simultaneously. As a result, we conduct several clustering experiments on four canonical data sets, implying that our proposed method with strong interpretability reaches comparable performance and computing efficiency with other state-of-the-art methods.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105235"},"PeriodicalIF":4.2,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pro-ReID: Producing reliable pseudo labels for unsupervised person re-identification","authors":"Haiming Sun, Shiwei Ma","doi":"10.1016/j.imavis.2024.105244","DOIUrl":"10.1016/j.imavis.2024.105244","url":null,"abstract":"<div><p>Mainstream unsupervised person ReIDentification (ReID) is on the basis of the alternation of clustering and fine-tuning to promote the task performance, but the clustering process inevitably produces noisy pseudo labels, which seriously constrains the further advancement of the task performance. To conquer the above concerns, the novel Pro-ReID framework is proposed to produce reliable person samples from the pseudo-labeled dataset to learn feature representations in this work. It consists of two modules: Pseudo Labels Correction (PLC) and Pseudo Labels Selection (PLS). Specifically, we further leverage the temporal ensemble prior knowledge to promote task performance. The PLC module assigns corresponding soft pseudo labels to each sample with control of soft pseudo label participation to potentially correct for noisy pseudo labels generated during clustering; the PLS module associates the predictions of the temporal ensemble model with pseudo label annotations and it detects noisy pseudo labele examples as out-of-distribution examples through the Gaussian Mixture Model (GMM) to supply reliable pseudo labels for the unsupervised person ReID task in consideration of their loss data distribution. Experimental findings validated on three person (Market-1501, DukeMTMC-reID and MSMT17) and one vehicle (VeRi-776) ReID benchmark establish that the novel Pro-ReID framework achieves competitive performance, in particular the mAP on the ambitious MSMT17 that is 4.3% superior to the state-of-the-art methods.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105244"},"PeriodicalIF":4.2,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142129876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Language conditioned multi-scale visual attention networks for visual grounding","authors":"Haibo Yao, Lipeng Wang, Chengtao Cai, Wei Wang, Zhi Zhang, Xiaobing Shang","doi":"10.1016/j.imavis.2024.105242","DOIUrl":"10.1016/j.imavis.2024.105242","url":null,"abstract":"<div><p>Visual grounding (VG) is a task that requires to locate a specific region in an image according to a natural language expression. Existing efforts on the VG task are divided into two-stage, one-stage and Transformer-based methods, which have achieved high performance. However, most of the previous methods extract visual information at a single spatial scale and ignore visual information at other spatial scales, which makes these models unable to fully utilize the visual information. Moreover, the insufficient utilization of linguistic information, especially failure to capture global linguistic information, may lead to failure to fully understand language expressions, thus limiting the performance of these models. To better address the task, we propose a language conditioned multi-scale visual attention network (LMSVA) for visual grounding, which can sufficiently utilize visual and linguistic information to perform multimodal reasoning, thus improving performance of model. Specifically, we design a visual feature extractor containing a multi-scale layer to get the required multi-scale visual features by expanding the original backbone. Moreover, we exploit pooling the output of the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model to extract sentence-level linguistic features, which can enable the model to capture global linguistic information. Inspired by the Transformer architecture, we present the Visual Attention Layer guided by Language and Multi-Scale Visual Features (VALMS), which is able to better learn the visual context guided by multi-scale visual and linguistic features, and facilitates further multimodal reasoning. Extensive experiments on four large benchmark datasets, including ReferItGame, RefCOCO, RefCOCO<!--> <!-->+ and RefCOCOg, demonstrate that our proposed model achieves the state-of-the-art performance.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105242"},"PeriodicalIF":4.2,"publicationDate":"2024-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning facial structural dependency in 3D aligned space for face alignment","authors":"Biying Li , Zhiwei Liu , Jinqiao Wang","doi":"10.1016/j.imavis.2024.105241","DOIUrl":"10.1016/j.imavis.2024.105241","url":null,"abstract":"<div><p>Facial structure's statistical characteristics offer pivotal prior information in facial landmark prediction, forming inter-dependencies among different landmarks. Such inter-dependencies ensure that predictions adhere to the shape distribution typical of natural faces. In challenging scenarios like occlusions or extreme facial poses, this structure becomes indispensable, which can help to predict elusive landmarks based on more discernible ones. While current deep learning methods do capture these landmark dependencies, it's often an implicit process heavily reliant on vast training datasets. We contest that such implicit modeling approaches fail to manage more challenging situations. In this paper, we propose a new method that harnesses the facial structure and explicitly explores inter-dependencies among facial landmarks in an end-to-end fashion. We propose a Structural Dependency Learning Module (SDLM). It uses 3D face information to map facial features into a canonical UV space, in which the facial structure is explicitly 3D semantically aligned. Besides, to explore the global relationships between facial landmarks, we take advantage of the self-attention mechanism in the image and UV spaces. We name the proposed method Facial Structure-based Face Alignment (FSFA). FSFA reinforces the landmark structure, especially under challenging conditions. Extensive experiments demonstrate that FSFA achieves state-of-the-art performance on the WFLW, 300W, AFLW, and COFW68 datasets.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105241"},"PeriodicalIF":4.2,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142083951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianheng Cheng , Haoyi Jiang , Shaoyu Chen , Bencheng Liao , Qian Zhang , Wenyu Liu , Xinggang Wang
{"title":"Learning accurate monocular 3D voxel representation via bilateral voxel transformer","authors":"Tianheng Cheng , Haoyi Jiang , Shaoyu Chen , Bencheng Liao , Qian Zhang , Wenyu Liu , Xinggang Wang","doi":"10.1016/j.imavis.2024.105237","DOIUrl":"10.1016/j.imavis.2024.105237","url":null,"abstract":"<div><p>Vision-based methods for 3D scene perception have been widely explored for autonomous vehicles. However, inferring complete 3D semantic scenes from monocular 2D images is still challenging owing to the 2D-to-3D transformation. Specifically, existing methods that use Inverse Perspective Mapping (IPM) to project image features to dense 3D voxels severely suffer from the ambiguous projection problem. In this research, we present <strong>Bilateral Voxel Transformer</strong> (BVT), a novel and effective Transformer-based approach for monocular 3D semantic scene completion. BVT exploits a bilateral architecture composed of two branches for preserving the high-resolution 3D voxel representation while aggregating contexts through the proposed Tri-Axial Transformer simultaneously. To alleviate the ill-posed 2D-to-3D transformation, we adopt position-aware voxel queries and dynamically update the voxels with image features through weighted geometry-aware sampling. BVT achieves 11.8 mIoU on the challenging Semantic KITTI dataset, considerably outperforming previous works for semantic scene completion with monocular images. The code and models of BVT will be available on <span><span>GitHub</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105237"},"PeriodicalIF":4.2,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142077211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sunpil Kim , Gang-Joon Yoon , Jinjoo Song , Sang Min Yoon
{"title":"Simultaneous image patch attention and pruning for patch selective transformer","authors":"Sunpil Kim , Gang-Joon Yoon , Jinjoo Song , Sang Min Yoon","doi":"10.1016/j.imavis.2024.105239","DOIUrl":"10.1016/j.imavis.2024.105239","url":null,"abstract":"<div><p>Vision transformer models provide superior performance compared to convolutional neural networks for various computer vision tasks but require increased computational overhead with large datasets. This paper proposes a patch selective vision transformer that effectively selects patches to reduce computational costs while simultaneously extracting global and local self-representative patch information to maintain performance. The inter-patch attention in the transformer encoder emphasizes meaningful features by capturing the inter-patch relationships of features, and dynamic patch pruning is applied to the attentive patches using a learnable soft threshold that measures the maximum multi-head importance scores. The proposed patch attention and pruning method provides constraints to exploit dominant feature maps in conjunction with self-attention, thus avoiding the propagation of noisy or irrelevant information. The proposed patch-selective transformer also helps to address computer vision problems such as scale, background clutter, and partial occlusion, resulting in a lightweight and general-purpose vision transformer suitable for mobile devices.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105239"},"PeriodicalIF":4.2,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142083950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}