Fengxi Sun , Ning He , Runjie Li , Hongfei Liu , Yuxiang Zou
{"title":"DetailCaptureYOLO: Accurately Detecting Small Targets in UAV Aerial Images","authors":"Fengxi Sun , Ning He , Runjie Li , Hongfei Liu , Yuxiang Zou","doi":"10.1016/j.jvcir.2024.104349","DOIUrl":"10.1016/j.jvcir.2024.104349","url":null,"abstract":"<div><div>Unmanned aerial vehicle aerial imagery is dominated by small objects, obtaining feature maps with more detailed information is crucial for target detection. Therefore, this paper presents an improved algorithm based on YOLOv9, named DetailCaptureYOLO, which has a strong ability to capture detailed features. First, a dynamic fusion path aggregation network is proposed to dynamically fuse multi-level and multi-scale feature maps, effectively ensuring information integrity and richer features. Additionally, more flexible dynamic upsampling and wavelet transform-based downsampling operators are used to optimize the sampling operations. Finally, the Inner-IoU is used in Powerful-IoU, effectively enhancing the network’s ability to detect small targets. The neck improvement proposed in this paper can be transferred to mainstream object detection algorithms. When applied to YOLOv9, AP50, mAP and AP-small were improved by 8.5%, 5.5% and 7.2%, on the VisDrone dataset. When applied to other algorithms, the improvements in AP50 were 5.1%–6.5%. Experimental results demonstrate that the proposed method excels in detecting small targets and exhibits strong transferability. The codes are at: <span><span>https://github.com/SFXSunFengXi/DetailCaptureYOLO</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"106 ","pages":"Article 104349"},"PeriodicalIF":2.6,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142743790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tiecheng Song , Yu Huang , Feng Yang , Anyong Qin , Yue Zhao , Chenqiang Gao
{"title":"Global–local prompts guided image-text embedding, alignment and aggregation for multi-label zero-shot learning","authors":"Tiecheng Song , Yu Huang , Feng Yang , Anyong Qin , Yue Zhao , Chenqiang Gao","doi":"10.1016/j.jvcir.2024.104347","DOIUrl":"10.1016/j.jvcir.2024.104347","url":null,"abstract":"<div><div>Multi-label zero-shot learning (MLZSL) aims to classify images into multiple unseen label classes, which is a practical yet challenging task. Recent methods have used vision-language models (VLM) for MLZSL, but they do not well consider the global and local semantic relationships to align images and texts, yielding limited classification performance. In this paper, we propose a novel MLZSL approach, named global–local prompts guided image-text embedding, alignment and aggregation (GLP-EAA) to alleviate this problem. Specifically, based on the parameter-frozen VLM, we divide the image into patches and explore a simple adapter to obtain global and local image embeddings. Meanwhile, we design global-local prompts to obtain text embeddings of different granularities. Then, we introduce global–local alignment losses to establish image-text consistencies at different granularity levels. Finally, we aggregate global and local scores to compute the multi-label classification loss. The aggregated scores are also used for inference. As such, our approach integrates prompt learning, image-text alignment and classification score aggregation into a unified learning framework. Experimental results on NUS-WIDE and MS-COCO datasets demonstrate the superiority of our approach over state-of-the-art methods for both ZSL and generalized ZSL tasks.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"106 ","pages":"Article 104347"},"PeriodicalIF":2.6,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142743786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pihong Hou , Yongfang Zhang , Yi Wu , Pengyu Yan , Fuqiang Zhang
{"title":"FormerPose: An efficient multi-scale fusion Transformer network based on RGB-D for 6D pose estimation","authors":"Pihong Hou , Yongfang Zhang , Yi Wu , Pengyu Yan , Fuqiang Zhang","doi":"10.1016/j.jvcir.2024.104346","DOIUrl":"10.1016/j.jvcir.2024.104346","url":null,"abstract":"<div><div>The 6D pose estimation based on RGB-D plays a crucial role in object localization and is widely used in the field of robotics. However, traditional CNN-based methods often face limitations, particularly in the scene with complex visuals characterized by minimal features or obstructed. To address these limitations, we propose a novel holistic 6D pose estimation method called FormerPose. It leverages an efficient multi-scale fusion Transformer network based on RGB-D to directly regress the object’s pose. FormerPose can efficiently extract the color and geometric features of objects at different scales, and fuse them based on self-attention and dense fusion method, making it suitable for more restricted scenes. The proposed network realizes an enhanced trade-off between computational efficiency and model performance, achieving in superior results on benchmark datasets, including LineMOD, LineMOD-Occlusion, and YCB-Video. In addition, the robustness and practicability of the method are further verified by a series of robot grasping experiments.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"106 ","pages":"Article 104346"},"PeriodicalIF":2.6,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142743789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaeseok Jang, Dahyun Kim, Dongkwon Jin, Chang-Su Kim
{"title":"Contour-based object forecasting for autonomous driving","authors":"Jaeseok Jang, Dahyun Kim, Dongkwon Jin, Chang-Su Kim","doi":"10.1016/j.jvcir.2024.104343","DOIUrl":"10.1016/j.jvcir.2024.104343","url":null,"abstract":"<div><div>A novel algorithm, called contour-based object forecasting (COF), to simultaneously perform contour-based segmentation and depth estimation of objects in future frames in autonomous driving systems is proposed in this paper. The proposed algorithm consists of encoding, future forecasting, decoding, and 3D rendering stages. First, we extract the features of observed frames, including past and current frames. Second, from these causal features, we predict the features of future frames using the future forecast module. Third, we decode the predicted features into contour and depth estimates. We obtain object depth maps aligned with segmentation masks via the depth completion using the predicted contours. Finally, from the prediction results, we render the forecasted objects in a 3D space. Experimental results demonstrate that the proposed algorithm reliably forecasts the contours and depths of objects in future frames and that the 3D rendering results intuitively visualize the future locations of the objects.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"106 ","pages":"Article 104343"},"PeriodicalIF":2.6,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142743788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabrice Ndayishimiye , Gang-Joon Yoon , Joonjae Lee , Sang Min Yoon
{"title":"Person re-identification transformer with patch attention and pruning","authors":"Fabrice Ndayishimiye , Gang-Joon Yoon , Joonjae Lee , Sang Min Yoon","doi":"10.1016/j.jvcir.2024.104348","DOIUrl":"10.1016/j.jvcir.2024.104348","url":null,"abstract":"<div><div>Person re-identification (Re-ID), which is widely used in surveillance and tracking systems, aims to search individuals as they move between different camera views by maintaining identity across various camera views. In the realm of person re-identification (Re-ID), recent advancements have introduced convolutional neural networks (CNNs) and vision transformers (ViTs) as promising solutions. While CNN-based methods excel in local feature extraction, ViTs have emerged as effective alternatives to CNN-based person Re-ID, offering the ability to capture long-range dependencies through multi-head self-attention without relying on convolution and downsampling. However, it still faces challenges such as changes in illumination, viewpoint, pose, low resolutions, and partial occlusions. To address the limitations of widely used person Re-ID datasets and improve the generalization, we present a novel person Re-ID method that enhances global and local information interactions using self-attention modules within a ViT network. It leverages dynamic pruning to extract and prioritize essential image patches effectively. The designed patch selection and pruning for person Re-ID model resulted in a robust feature extractor even in scenarios with partial occlusion, background clutter, and illumination variations. Empirical validation demonstrates its superior performance compared to previous approaches and its adaptability across various domains.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"106 ","pages":"Article 104348"},"PeriodicalIF":2.6,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142743785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianming Zhang, Jia Jiang, Mingshuang Wu, Zhijian Feng, Xiangnan Shi
{"title":"Illumination-guided dual-branch fusion network for partition-based image exposure correction","authors":"Jianming Zhang, Jia Jiang, Mingshuang Wu, Zhijian Feng, Xiangnan Shi","doi":"10.1016/j.jvcir.2024.104342","DOIUrl":"10.1016/j.jvcir.2024.104342","url":null,"abstract":"<div><div>Images captured in the wild often suffer from issues such as under-exposure, over-exposure, or sometimes a combination of both. These images tend to lose details and texture due to uneven exposure. The majority of image enhancement methods currently focus on correcting either under-exposure or over-exposure, but there are only a few methods available that can effectively handle these two problems simultaneously. In order to address these issues, a novel partition-based exposure correction method is proposed. Firstly, our method calculates the illumination map to generate a partition mask that divides the original image into under-exposed and over-exposed areas. Then, we propose a Transformer-based parameter estimation module to estimate the dual gamma values for partition-based exposure correction. Finally, we introduce a dual-branch fusion module to merge the original image with the exposure-corrected image to obtain the final result. It is worth noting that the illumination map plays a guiding role in both the dual gamma model parameters estimation and the dual-branch fusion. Extensive experiments demonstrate that the proposed method consistently achieves superior performance over state-of-the-art (SOTA) methods on 9 datasets with paired or unpaired samples. Our codes are available at <span><span>https://github.com/csust7zhangjm/ExposureCorrectionWMS</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"106 ","pages":"Article 104342"},"PeriodicalIF":2.6,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhanced soft domain adaptation for object detection in the dark","authors":"Yunfei Bai , Chang Liu , Rui Yang , Xiaomao Li","doi":"10.1016/j.jvcir.2024.104337","DOIUrl":"10.1016/j.jvcir.2024.104337","url":null,"abstract":"<div><div>Unlike foggy conditions, domain adaptation is rarely facilitated in dark detection tasks due to the lack of dark datasets. We generate target low-light images via swapping the ring-shaped frequency spectrum of Exdark with Cityscapes, and surprisingly find the promotion is less satisfactory. The root lies in non-transferable alignment that excessively highlights dark backgrounds. To tackle this issue, we propose an Enhanced Soft Domain Adaptation (ESDA) framework to focus on background misalignment. Specifically, Soft Domain Adaptation (SDA) compensates for over-alignment of backgrounds by providing different soft labels for foreground and background samples. The Highlight Foreground (HF), by introducing center sampling, increases the number of high-quality background samples for training. Suppress Background (SB) weakens non-transferable background alignment by replacing foreground scores with backgrounds. Experimental results show SDA combined with HF and SB is sufficiently strengthened and achieves state-of-the-art performance using multiple cross-domain benchmarks. Note that ESDA yields 11.8% relative improvement on the real-world ExDark dataset.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"106 ","pages":"Article 104337"},"PeriodicalIF":2.6,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142743784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HRGUNet: A novel high-resolution generative adversarial network combined with an improved UNet method for brain tumor segmentation","authors":"Dongmei Zhou, Hao Luo, Xingyang Li, Shengbing Chen","doi":"10.1016/j.jvcir.2024.104345","DOIUrl":"10.1016/j.jvcir.2024.104345","url":null,"abstract":"<div><div>Brain tumor segmentation in MRI images is challenging due to variability in tumor characteristics and low contrast. We propose HRGUNet, which combines a high-resolution generative adversarial network with an improved UNet architecture to enhance segmentation accuracy. Our proposed GAN model uses an innovative discriminator design that is able to process complete tumor labels as input. This approach can better ensure that the generator produces realistic tumor labels compared to some existing GAN models that only use local features. Additionally, we introduce a Multi-Scale Pyramid Fusion (MSPF) module to improve fine-grained feature extraction and a Refined Channel Attention (RCA) module to enhance focus on tumor regions. In comparative experiments, our method was verified on the BraTS2020 and BraTS2019 data sets, and the average Dice coefficient increased by 1.5% and 1.2% respectively, and the Hausdorff distance decreased by 23.9% and 15.2% respectively, showing its robustness and generalization for segmenting complex tumor structures.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104345"},"PeriodicalIF":2.6,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142706865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wujian Ye , Yue Wang , Yijun Liu , Wenjie Lin , Xin Xiang
{"title":"Panoramic Arbitrary Style Transfer with Deformable Distortion Constraints","authors":"Wujian Ye , Yue Wang , Yijun Liu , Wenjie Lin , Xin Xiang","doi":"10.1016/j.jvcir.2024.104344","DOIUrl":"10.1016/j.jvcir.2024.104344","url":null,"abstract":"<div><div>Neural style transfer is a prominent AI technique for creating captivating visual effects and enhancing user experiences. However, most current methods inadequately handle panoramic images, leading to a loss of original visual semantics and emotions due to insufficient structural feature consideration. To address this, a novel panorama arbitrary style transfer method named PAST-Renderer is proposed by integrating deformable convolutions and distortion constraints. The proposed method can dynamically adjust the position of the convolutional kernels according to the geometric structure of the input image, thereby better adapting to the spatial distortions and deformations in panoramic images. Deformable convolutions enable adaptive transformations on a two-dimensional plane, enhancing content and style feature extraction and fusion in panoramic images. Distortion constraints adjust content and style losses, ensuring semantic consistency in salience, edge, and depth of field with the original image. Experimental results show significant improvements, with the PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index Measure) of stylized panoramic images’ semantic maps increasing by approximately 2–4 dB and 0.1–0.3, respectively. Our method PAST-Renderer performs better in both artistic and realistic style transfer, preserving semantic integrity with natural colors, realistic edge details, and rich thematic content.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"106 ","pages":"Article 104344"},"PeriodicalIF":2.6,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142743787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Zhou , Qinghua Su , Zhongbo Hu , Shaojie Jiang
{"title":"Underwater image enhancement method via extreme enhancement and ultimate weakening","authors":"Yang Zhou , Qinghua Su , Zhongbo Hu , Shaojie Jiang","doi":"10.1016/j.jvcir.2024.104341","DOIUrl":"10.1016/j.jvcir.2024.104341","url":null,"abstract":"<div><div>The existing histogram-based methods for underwater image enhancement are prone to over-enhancement, which will affect the analysis of enhanced images. However, an idea that achieves contrast balance by enhancing and weakening the contrast of an image can address the problem. Therefore, an underwater image enhancement method based on extreme enhancement and ultimate weakening (EEUW) is proposed in this paper. This approach comprises two main steps. Firstly, an image with extreme contrast can be achieved by applying grey prediction evolution algorithm (GPE), which is the first time that GPE is introduced into dual-histogram thresholding method to find the optimal segmentation threshold for accurate segmentation. Secondly, a pure gray image can be obtained through a fusion strategy based on the grayscale world assumption to achieve the ultimate weakening. Experiments conducted on three standard underwater image benchmark datasets validate that EEUW outperforms the 10 state-of-the-art methods in improving the contrast of underwater images.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104341"},"PeriodicalIF":2.6,"publicationDate":"2024-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142706864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}