Yi Li , Sile Ma , Xiangyuan Jiang , Yizhong Luan , Zecui Jiang
{"title":"Probability based dynamic soft label assignment for object detection","authors":"Yi Li , Sile Ma , Xiangyuan Jiang , Yizhong Luan , Zecui Jiang","doi":"10.1016/j.imavis.2024.105240","DOIUrl":"10.1016/j.imavis.2024.105240","url":null,"abstract":"<div><p>By defining effective supervision labels for network training, the performance of object detectors can be improved without incurring additional inference costs. Current label assignment strategies generally require two steps: first, constructing a positive sample candidate bag, and then designing labels for these samples. However, the construction of candidate bag of positive samples may result in some noisy samples being introduced into the label assignment process. We explore a single-step label assignment approach: directly generating a probability map as labels for all samples. We design the label assignment approach from the following perspectives: Firstly, it should be able to reduce the impact of noise samples. Secondly, each sample should be treated differently because each one matches the target to a different extent, which assists the network to learn more valuable information from high-quality samples. We propose a probability-based dynamic soft label assignment method. Instead of dividing the samples into positive and negative samples, a probability map, which is calculated based on prediction quality and prior knowledge, is used to supervise all anchor points of the classification branch. The weight of prior knowledge in the labels decreases as the network improves the quality of instance predictions, as a way to reduce noise samples introduced by prior knowledge. By using continuous probability values as labels to supervise the classification branch, the network is able to focus on high-quality samples. As demonstrated in the experiments on the MS COCO benchmark, our label assignment method achieves 40.9% AP in the ResNet-50 under 1x schedule, which improves FCOS performance by approximately 2.0% AP. The code has been available at <span><span><span>https://github.com/Liyi4578/PDSLA</span></span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105240"},"PeriodicalIF":4.2,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CRENet: Crowd region enhancement network for multi-person 3D pose estimation","authors":"Zhaokun Li, Qiong Liu","doi":"10.1016/j.imavis.2024.105243","DOIUrl":"10.1016/j.imavis.2024.105243","url":null,"abstract":"<div><p>Recovering multi-person 3D poses from a single image is a challenging problem due to inherent depth ambiguities, including root-relative depth and absolute root depth. Current bottom-up methods show promising potential to mitigate absolute root depth ambiguity through explicitly aggregating global contextual cues. However, these methods treat the entire image region equally during root depth regression, ignoring the negative impact of irrelevant regions. Moreover, they learn shared features for both depths, each of which focuses on different information. This sharing mechanism may result in negative transfer, thus diminishing root depth prediction accuracy. To address these challenges, we present a novel bottom-up method, Crowd Region Enhancement Network (CRENet), incorporating a Feature Decoupling Module (FDM) and a Global Attention Module (GAM). FDM explicitly learns the discriminative feature for each depth through adaptively recalibrating its channel-wise responses and fusing multi-level features, which makes the model focus on each depth prediction separately and thus avoids the adverse effect of negative transfer. GAM highlights crowd regions while suppressing irrelevant regions using the attention mechanism and further refines the attention regions based on the confidence measure about the attention, which is beneficial to learn depth-related cues from informative crowd regions and facilitate root depth estimation. Comprehensive experiments on benchmarks MuPoTS-3D and CMU Panoptic demonstrate that our method outperforms the state-of-the-art bottom-up methods in absolute 3D pose estimation and is applicable to in-the-wild images, which also indicates that learning depth-specific features and suppressing the noise signals can significantly benefit multi-person absolute 3D pose estimation.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105243"},"PeriodicalIF":4.2,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual subspace clustering for spectral-spatial hyperspectral image clustering","authors":"Shujun Liu","doi":"10.1016/j.imavis.2024.105235","DOIUrl":"10.1016/j.imavis.2024.105235","url":null,"abstract":"<div><p>Subspace clustering supposes that hyperspectral image (HSI) pixels lie in a union vector spaces of multiple sample subspaces without considering their dual space, i.e., spectral space. In this article, we propose a promising dual subspace clustering (DualSC) for improving spectral-spatial HSIs clustering by relaxing subspace clustering. To this end, DualSC simultaneously optimizes row and column subspace-representations of HSI superpixels to capture the intrinsic connection between spectral and spatial information. From the new perspective, the original subspace clustering can be treated as a special case of DualSC that has larger solution space, so tends to finding better sample representation matrix for applying spectral clustering. Besides, we provide theoretical proofs that show the proposed method relaxes the subspace space clustering with dual subspace, and can recover subspace-sparse representation of HSI samples. To the best of our knowledge, this work could be one of the first dual clustering method leveraging sample and spectral subspaces simultaneously. As a result, we conduct several clustering experiments on four canonical data sets, implying that our proposed method with strong interpretability reaches comparable performance and computing efficiency with other state-of-the-art methods.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105235"},"PeriodicalIF":4.2,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pro-ReID: Producing reliable pseudo labels for unsupervised person re-identification","authors":"Haiming Sun, Shiwei Ma","doi":"10.1016/j.imavis.2024.105244","DOIUrl":"10.1016/j.imavis.2024.105244","url":null,"abstract":"<div><p>Mainstream unsupervised person ReIDentification (ReID) is on the basis of the alternation of clustering and fine-tuning to promote the task performance, but the clustering process inevitably produces noisy pseudo labels, which seriously constrains the further advancement of the task performance. To conquer the above concerns, the novel Pro-ReID framework is proposed to produce reliable person samples from the pseudo-labeled dataset to learn feature representations in this work. It consists of two modules: Pseudo Labels Correction (PLC) and Pseudo Labels Selection (PLS). Specifically, we further leverage the temporal ensemble prior knowledge to promote task performance. The PLC module assigns corresponding soft pseudo labels to each sample with control of soft pseudo label participation to potentially correct for noisy pseudo labels generated during clustering; the PLS module associates the predictions of the temporal ensemble model with pseudo label annotations and it detects noisy pseudo labele examples as out-of-distribution examples through the Gaussian Mixture Model (GMM) to supply reliable pseudo labels for the unsupervised person ReID task in consideration of their loss data distribution. Experimental findings validated on three person (Market-1501, DukeMTMC-reID and MSMT17) and one vehicle (VeRi-776) ReID benchmark establish that the novel Pro-ReID framework achieves competitive performance, in particular the mAP on the ambitious MSMT17 that is 4.3% superior to the state-of-the-art methods.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105244"},"PeriodicalIF":4.2,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142129876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Language conditioned multi-scale visual attention networks for visual grounding","authors":"Haibo Yao, Lipeng Wang, Chengtao Cai, Wei Wang, Zhi Zhang, Xiaobing Shang","doi":"10.1016/j.imavis.2024.105242","DOIUrl":"10.1016/j.imavis.2024.105242","url":null,"abstract":"<div><p>Visual grounding (VG) is a task that requires to locate a specific region in an image according to a natural language expression. Existing efforts on the VG task are divided into two-stage, one-stage and Transformer-based methods, which have achieved high performance. However, most of the previous methods extract visual information at a single spatial scale and ignore visual information at other spatial scales, which makes these models unable to fully utilize the visual information. Moreover, the insufficient utilization of linguistic information, especially failure to capture global linguistic information, may lead to failure to fully understand language expressions, thus limiting the performance of these models. To better address the task, we propose a language conditioned multi-scale visual attention network (LMSVA) for visual grounding, which can sufficiently utilize visual and linguistic information to perform multimodal reasoning, thus improving performance of model. Specifically, we design a visual feature extractor containing a multi-scale layer to get the required multi-scale visual features by expanding the original backbone. Moreover, we exploit pooling the output of the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model to extract sentence-level linguistic features, which can enable the model to capture global linguistic information. Inspired by the Transformer architecture, we present the Visual Attention Layer guided by Language and Multi-Scale Visual Features (VALMS), which is able to better learn the visual context guided by multi-scale visual and linguistic features, and facilitates further multimodal reasoning. Extensive experiments on four large benchmark datasets, including ReferItGame, RefCOCO, RefCOCO<!--> <!-->+ and RefCOCOg, demonstrate that our proposed model achieves the state-of-the-art performance.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105242"},"PeriodicalIF":4.2,"publicationDate":"2024-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning facial structural dependency in 3D aligned space for face alignment","authors":"Biying Li , Zhiwei Liu , Jinqiao Wang","doi":"10.1016/j.imavis.2024.105241","DOIUrl":"10.1016/j.imavis.2024.105241","url":null,"abstract":"<div><p>Facial structure's statistical characteristics offer pivotal prior information in facial landmark prediction, forming inter-dependencies among different landmarks. Such inter-dependencies ensure that predictions adhere to the shape distribution typical of natural faces. In challenging scenarios like occlusions or extreme facial poses, this structure becomes indispensable, which can help to predict elusive landmarks based on more discernible ones. While current deep learning methods do capture these landmark dependencies, it's often an implicit process heavily reliant on vast training datasets. We contest that such implicit modeling approaches fail to manage more challenging situations. In this paper, we propose a new method that harnesses the facial structure and explicitly explores inter-dependencies among facial landmarks in an end-to-end fashion. We propose a Structural Dependency Learning Module (SDLM). It uses 3D face information to map facial features into a canonical UV space, in which the facial structure is explicitly 3D semantically aligned. Besides, to explore the global relationships between facial landmarks, we take advantage of the self-attention mechanism in the image and UV spaces. We name the proposed method Facial Structure-based Face Alignment (FSFA). FSFA reinforces the landmark structure, especially under challenging conditions. Extensive experiments demonstrate that FSFA achieves state-of-the-art performance on the WFLW, 300W, AFLW, and COFW68 datasets.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105241"},"PeriodicalIF":4.2,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142083951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianheng Cheng , Haoyi Jiang , Shaoyu Chen , Bencheng Liao , Qian Zhang , Wenyu Liu , Xinggang Wang
{"title":"Learning accurate monocular 3D voxel representation via bilateral voxel transformer","authors":"Tianheng Cheng , Haoyi Jiang , Shaoyu Chen , Bencheng Liao , Qian Zhang , Wenyu Liu , Xinggang Wang","doi":"10.1016/j.imavis.2024.105237","DOIUrl":"10.1016/j.imavis.2024.105237","url":null,"abstract":"<div><p>Vision-based methods for 3D scene perception have been widely explored for autonomous vehicles. However, inferring complete 3D semantic scenes from monocular 2D images is still challenging owing to the 2D-to-3D transformation. Specifically, existing methods that use Inverse Perspective Mapping (IPM) to project image features to dense 3D voxels severely suffer from the ambiguous projection problem. In this research, we present <strong>Bilateral Voxel Transformer</strong> (BVT), a novel and effective Transformer-based approach for monocular 3D semantic scene completion. BVT exploits a bilateral architecture composed of two branches for preserving the high-resolution 3D voxel representation while aggregating contexts through the proposed Tri-Axial Transformer simultaneously. To alleviate the ill-posed 2D-to-3D transformation, we adopt position-aware voxel queries and dynamically update the voxels with image features through weighted geometry-aware sampling. BVT achieves 11.8 mIoU on the challenging Semantic KITTI dataset, considerably outperforming previous works for semantic scene completion with monocular images. The code and models of BVT will be available on <span><span>GitHub</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105237"},"PeriodicalIF":4.2,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142077211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sunpil Kim , Gang-Joon Yoon , Jinjoo Song , Sang Min Yoon
{"title":"Simultaneous image patch attention and pruning for patch selective transformer","authors":"Sunpil Kim , Gang-Joon Yoon , Jinjoo Song , Sang Min Yoon","doi":"10.1016/j.imavis.2024.105239","DOIUrl":"10.1016/j.imavis.2024.105239","url":null,"abstract":"<div><p>Vision transformer models provide superior performance compared to convolutional neural networks for various computer vision tasks but require increased computational overhead with large datasets. This paper proposes a patch selective vision transformer that effectively selects patches to reduce computational costs while simultaneously extracting global and local self-representative patch information to maintain performance. The inter-patch attention in the transformer encoder emphasizes meaningful features by capturing the inter-patch relationships of features, and dynamic patch pruning is applied to the attentive patches using a learnable soft threshold that measures the maximum multi-head importance scores. The proposed patch attention and pruning method provides constraints to exploit dominant feature maps in conjunction with self-attention, thus avoiding the propagation of noisy or irrelevant information. The proposed patch-selective transformer also helps to address computer vision problems such as scale, background clutter, and partial occlusion, resulting in a lightweight and general-purpose vision transformer suitable for mobile devices.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105239"},"PeriodicalIF":4.2,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142083950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wentao Li , Deming Fan , Qi Zhu, Zhanjiang Gao, Hao Sun
{"title":"HEDehazeNet: Unpaired image dehazing via enhanced haze generation","authors":"Wentao Li , Deming Fan , Qi Zhu, Zhanjiang Gao, Hao Sun","doi":"10.1016/j.imavis.2024.105236","DOIUrl":"10.1016/j.imavis.2024.105236","url":null,"abstract":"<div><p>Unpaired image dehazing models based on Cycle-Consistent Adversarial Networks (CycleGAN) typically consist of two cycle branches: dehazing-rehazing branch and hazing-dehazing branch. In these two branches, there is an asymmetry of information in the mutual transformation process between haze images and haze-free images. Previous models tended to focus more on the transformation process from haze images to haze-free images within the dehazing-rehazing branch, overlooking the provision of effective information for the formation of haze images in the hazing-dehazing branch. This oversight results in the production of haze patterns that are both monotonous and simplistic, ultimately impeding the overall performance and generalization capabilities of dehazing networks. In light of this, this paper proposes a novel model called HEDehazeNet (Dehazing Net based on Haze Generation Enhancement), which provides crucial information for the generation process of haze images through a dedicated haze generation enhancement module. This module is capable of producing three distinct modes of transmission maps - random transmission map, simulated transmission map, and mixed transmission maps combining both. Employing these transmission maps to generate hazing images with varying density and patterns provides the dehazing network with a more diverse and dynamically complex set of training samples, thereby enhancing its capacity to handle intricate scenes. Additionally, we made minor modifications to the U-Net, replacing residual blocks with multi-scale parallel convolutional blocks and channel self-attention, to further enhance the network's performance. Experiments were conducted on both synthetic and real-world datasets, substantiating the superiority of HEDehazeNet over the current state-of-the-art unpaired dehazing models.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105236"},"PeriodicalIF":4.2,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S026288562400341X/pdfft?md5=fed6ad904a3f88e450cfdc7c4feb5004&pid=1-s2.0-S026288562400341X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting adversarial samples by noise injection and denoising","authors":"Han Zhang , Xin Zhang , Yuan Sun , Lixia Ji","doi":"10.1016/j.imavis.2024.105238","DOIUrl":"10.1016/j.imavis.2024.105238","url":null,"abstract":"<div><p>Deep learning models are highly vulnerable to adversarial examples, leading to significant attention on techniques for detecting them. However, current methods primarily rely on detecting image features for identifying adversarial examples, often failing to address the diverse types and intensities of such examples. We propose a novel adversarial example detection method based on perturbation estimation and denoising to overcome this limitation. We develop an autoencoder to predict the latent adversarial perturbations of samples and select appropriately sized noise based on these predictions to cover the perturbations. Subsequently, we employ a non-blind denoising autoencoder to remove noise and residual perturbations effectively. This approach allows us to eliminate adversarial perturbations while preserving the original information, thus altering the prediction results of adversarial examples without affecting predictions on benign samples. Inconsistencies in predictions before and after processing by the model identify adversarial examples. Our experiments on datasets such as MNIST, CIFAR-10, and ImageNet demonstrate that our method surpasses other advanced detection methods in accuracy.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105238"},"PeriodicalIF":4.2,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142129875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}