Uğur Erkan , Ahmet Yilmaz , Abdurrahim Toktas , Qiang Lai , Suo Gao
{"title":"Object detection-based deep autoencoder hashing image retrieval","authors":"Uğur Erkan , Ahmet Yilmaz , Abdurrahim Toktas , Qiang Lai , Suo Gao","doi":"10.1016/j.image.2025.117384","DOIUrl":"10.1016/j.image.2025.117384","url":null,"abstract":"<div><div>Image Retrieval (IR), which returns similar images from a large image database, has become an important task as multimedia data grows. Existing studies utilize hash code representing the image features generated from the whole image, including redundant semantics from the background. In this study, a novel Object Detection-based Hashing IR (ODH-IR) scheme using You Only Look Once (YOLO) and an autoencoder is presented to ignore clutter in the images. Integration of YOLO and the autoencoder provides the most representative hash code depending on meaningful objects in the images. The autoencoder is exploited to compress the detected object vector to the desired bit length of the hash code. The ODH-IR scheme is validated by comparison with the state of the art through three well-known datasets in terms of precise metrics. The ODH-IR totally has the best 35 metric results over 36 measurements and the best avg. mean rank of 1.03. Moreover, it is observed from the three illustrative IR examples that it retrieves the most relevant semantics. The results demonstrate that the ODH-IR is an impactful scheme thanks to the effective hashing method through object detection using YOLO and the autoencoder.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"138 ","pages":"Article 117384"},"PeriodicalIF":3.4,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144694958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangfen Zhang , Shitao Hong , Haixia Luo , Zhen Jiang , Feiniu Yuan
{"title":"A dual-level part distillation network for fine-grained visual categorization","authors":"Xiangfen Zhang , Shitao Hong , Haixia Luo , Zhen Jiang , Feiniu Yuan","doi":"10.1016/j.image.2025.117383","DOIUrl":"10.1016/j.image.2025.117383","url":null,"abstract":"<div><div>Fine-Grained Visual Categorization (FGVC) remains a formidable challenge due to large intra-class variation and small inter-class variation, which can only be recognized by local details. Existing methods adopt part detection modules to localize discriminative regions for extracting part-level features, which offer crucial supplementary information for FGVC. However, these methods suffer from high computational complexity stemming from part detection and part-level feature extraction, while also lacking connectivity between different parts. To solve these problems, we propose a Dual-level Part Distillation Network (DPD-Net) for FGVC. Our DPD-Net extracts features at both object and part levels. In the object level, we first use residual networks to extract middle and high level features for generating middle and high object-level predictions, and concatenate these two predictions to produce the final output. In the part level, we use a part detection module to localize discriminative parts for extracting part-level features, point-wisely add features of different parts to generate an averaged part-level prediction, and concatenate different part features to produce a concatenated part-level prediction. We use knowledge distillation to transfer information from the averaged and concatenated part-level predictions to the middle and high object-level predictions, respectively. To supervise the training of our method, we design five losses, namely the pair-wise consistency of detected parts, the concatenated final prediction, the averaged part-level prediction, the cosine-embedding loss, and the concatenated part-level prediction. Experimental results show that our DPD-Net achieves state-of-the-art performance on three Fine-Grained Visual Recognition benchmarks. In addition, our DPD-Net can be trained end-to-end without extra annotations.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"138 ","pages":"Article 117383"},"PeriodicalIF":3.4,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144704031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoxi Liu , Ju Liu , Lingchen Gu , Yafeng Li , Xiaojun Chang , Feiping Nie
{"title":"Mining the Salient Spatio-Temporal Feature with S2TF-Net for action recognition","authors":"Xiaoxi Liu , Ju Liu , Lingchen Gu , Yafeng Li , Xiaojun Chang , Feiping Nie","doi":"10.1016/j.image.2025.117381","DOIUrl":"10.1016/j.image.2025.117381","url":null,"abstract":"<div><div>Recently, 3D Convolutional Neural Networks (3D ConvNets) have been widely exploited for action recognition and achieved satisfying performance. However, the superior action features are often drowned in numerous irrelevant information, which immensely enhances the difficulty of video representation. To find a generic cost-efficient approach to balance the parameters and performance, we present a novel network to mine the <strong>S</strong>alient <strong>S</strong>patio-<strong>T</strong>emporal <strong>F</strong>eature based on 3D ConvNets backbone for action recognition, termed as S<sup>2</sup>TF-Net. Firstly, we extract the salient features of each 3D residual block by constructing a multi-scale module for <strong>S</strong>alient <strong>S</strong>emantic <strong>F</strong>eature mining (SSF-Module). Then, with the aim of preserving the salient features in pooling operations, we establish a <strong>T</strong>wo-branch <strong>S</strong>alient <strong>F</strong>eature <strong>P</strong>reserving Module (TSFP-Module). Besides, these above two modules with proper loss function can collaborate in an “easy-to-concat” fashion for most 3D ResNet backbones to classify more accurately albeit in the shallower network. Finally, we conduct experiments over three popular action recognition datasets, where our S<sup>2</sup>TF-Net is competitive compared with the deeper 3D backbones or current state-of-the-art results. Treating the P3D, 3D ResNet, Non-local I3D and X3D as baseline, the proposed method improves them to varying degrees. Particularly, for Non-local I3D ResNet, the proposed S<sup>2</sup>TF-Net enhances 4.1%, 3.0% and 4.6% in Kinetics-400, UCF101 and HMDB51 datasets, achieving the accuracy of 74.8%, 95.1% and 80.9%. We hope this study will provide useful inspiration and experience for future research about more cost-effective methods. Code is released here: <span><span>https://github.com/xiaoxiAries/S2TFNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"138 ","pages":"Article 117381"},"PeriodicalIF":3.4,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144633876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmed Telili , Wassim Hamidouche , Ibrahim Farhat , Hadi Amirpour , Christian Timmerer , Ibrahim Khadraoui , Jiajie Lu , The Van Le , Jeonneung Baek , Jin Young Lee , Yiying Wei , Xiaopeng Sun , Yu Gao , JianCheng Huang , Yujie Zhong
{"title":"360-degree video super resolution and quality enhancement challenge: Methods and results","authors":"Ahmed Telili , Wassim Hamidouche , Ibrahim Farhat , Hadi Amirpour , Christian Timmerer , Ibrahim Khadraoui , Jiajie Lu , The Van Le , Jeonneung Baek , Jin Young Lee , Yiying Wei , Xiaopeng Sun , Yu Gao , JianCheng Huang , Yujie Zhong","doi":"10.1016/j.image.2025.117376","DOIUrl":"10.1016/j.image.2025.117376","url":null,"abstract":"<div><div>Omnidirectional (360-degree) video is rapidly gaining popularity due to advancements in immersive technologies like virtual reality (VR) and extended reality (XR). However, real-time streaming of such videos, particularly in live mobile scenarios such as unmanned aerial vehicles (UAVs), is hindered by limited bandwidth and strict latency constraints. While traditional methods such as compression and adaptive resolution are helpful, they often compromise video quality and introduce artifacts that diminish the viewer’s experience. Additionally, the unique spherical geometry of 360-degree video, with its wide field of view, presents challenges not encountered in traditional 2D video. To address these challenges, we initiated the 360-degree Video Super Resolution and Quality Enhancement challenge. This competition encourages participants to develop efficient machine learning (ML)-powered solutions to enhance the quality of low-bitrate compressed 360-degree videos, under two tracks focusing on <span><math><mrow><mn>2</mn><mo>×</mo></mrow></math></span> and <span><math><mrow><mn>4</mn><mo>×</mo></mrow></math></span> super-resolution (SR). In this paper, we outline the challenge framework, detailing the two competition tracks and highlighting the SR solutions proposed by the top-performing models. We assess these models within a unified framework, (<em>i</em>) considering quality enhancement, (<em>ii</em>) bitrate gain, and (<em>iii</em>) computational efficiency. Our findings show that lightweight single-frame models can effectively balance visual quality and runtime performance under constrained conditions, setting strong baselines for future research. These insights offer practical guidance for advancing real-time 360-degree video streaming, particularly in bandwidth-limited immersive applications.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"138 ","pages":"Article 117376"},"PeriodicalIF":3.4,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144632009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CNN-augmented SAR image despeckling using modified speckle reducing anisotropic diffusion and discrete wavelet transform","authors":"Satyakam Baraha , Buddepu Santhosh Kumar , Abhijit Mishra , Monalisa Ghosh","doi":"10.1016/j.image.2025.117380","DOIUrl":"10.1016/j.image.2025.117380","url":null,"abstract":"<div><div>Speckle, a multiplicative granular noise, inherently appears in coherent imaging techniques such as synthetic aperture radar (SAR). It deteriorates the visual quality of images, which leads to difficulty in image interpretation for further analysis. Hence, speckle filtering is essential to recover the image details for applications like segmentation and classification. Several despeckling techniques have been developed in the literature, among which anisotropic diffusion (AD) and discrete wavelet transform (DWT) based methods have achieved state-of-the-art despeckling performance. However, AD cannot be employed indefinitely owing to blurring and detail loss. Similarly, DWT produces spurious noise around edges. This paper proposes a high-performance despeckling technique that uses modified speckle reducing anisotropic diffusion as the preprocessing step in a homomorphic architecture. The architecture uses discrete wavelet transform, dynamic weighted adaptive thresholding (DWAT), weighted least squares, and guided filtering to recover the clean image. In addition, to enhance the performance of the despeckling process, a convolutional neural network (CNN) is used as a subsequent processing module to remove residual speckle while preserving the edges. The CNN uses a supervised learning paradigm trained on simulated speckled and clean image pairs to fine-tune the despeckled output. Subjective (visual) and objective evaluations on both simulated and real SAR datasets demonstrate that the proposed hybrid approach achieves robust despeckling performance, particularly excelling in edge preservation, radiometric consistency, and detail reconstruction across varied scene types as compared to the existing methods.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"138 ","pages":"Article 117380"},"PeriodicalIF":3.4,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144572693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biao Fang , Xiaoguang Lv , Guoliang Zhu , Jiaqi Mei , Le Jiang
{"title":"Edge-preserving smoothing using the truncated ℓp minimization","authors":"Biao Fang , Xiaoguang Lv , Guoliang Zhu , Jiaqi Mei , Le Jiang","doi":"10.1016/j.image.2025.117378","DOIUrl":"10.1016/j.image.2025.117378","url":null,"abstract":"<div><div>Edge-preserving smoothing is a fundamental task in visual processing and computational photography. This paper presents a nonconvex variational optimization model for edge-preserving smoothing, using a truncated <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mi>p</mi></mrow></msub></math></span> function as the regularization term and the weighted <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>2</mn></mrow></msub></math></span> norm as the fidelity term. The truncated <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mi>p</mi></mrow></msub></math></span> function penalizes gradients below a given threshold, while the weighted <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>2</mn></mrow></msub></math></span> norm is preferred over the <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span> and <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>2</mn></mrow></msub></math></span> norms. The proposed model can preserve the salient edges of the input image and eliminate insignificant details. To solve the proposed nonconvex model, we design an effective algorithm based on the alternating direction method of multipliers (ADMM). The effectiveness of the proposed method is demonstrated by a variety of applications, including texture smoothing, clip-art compression artifact removal, image abstraction, image denoising, high dynamic range (HDR) tone mapping, detail enhancement, and flash and no-flash image restoration.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"138 ","pages":"Article 117378"},"PeriodicalIF":3.4,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144514421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xianxian Zeng , Jie Zhou , Canqing Ye , Jun Yuan , Jiawen Li , Jianjian Jiang , Rongjun Chen , Shun Liu
{"title":"Stacked and decorrelated hashing with AdapTanh for large-scale fine-grained image retrieval","authors":"Xianxian Zeng , Jie Zhou , Canqing Ye , Jun Yuan , Jiawen Li , Jianjian Jiang , Rongjun Chen , Shun Liu","doi":"10.1016/j.image.2025.117374","DOIUrl":"10.1016/j.image.2025.117374","url":null,"abstract":"<div><div>Large-scale fine-grained image retrieval is a challenging task in computer vision, often addressed through learning to hash. Current methods typically use deep neural networks to create compact hash functions, but feature extraction through fusion or cascading can introduce coupling, limiting model generalization. To overcome this, we propose a multi-model stacked and decorrelated hashing approach, utilizing parallel backbone networks as feature extractors. A decorrelation objective, based on diagonal matrices, minimizes feature correlation, ensuring diverse hashing features. We also introduce a relaxation strategy to enhance the sensitivity of the output layer to fine-grained features. Experiments on various datasets demonstrate our model’s superior retrieval performance over state-of-the-art deep hashing methods.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"138 ","pages":"Article 117374"},"PeriodicalIF":3.4,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144489452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Light field video streaming on GPU","authors":"Tomáš Chlubna , Tomáš Milet , Pavel Zemčík","doi":"10.1016/j.image.2025.117377","DOIUrl":"10.1016/j.image.2025.117377","url":null,"abstract":"<div><div>This paper proposes an efficient encoding method for light field video rendering in real time. Each frame of the light field video consists of a grid of images capturing the scene from different camera positions. The images are encoded by a video compression algorithm. The positions of the keyframes on the grid are automatically determined. The proposed compression uses GPU-accelerated HW video decoders. Data transfer between the host and the GPU memory is minimal. Only the packets necessary for the novel view synthesis are transferred. Standard video compression methods need to decode all packets between keyframes, and other existing light field compression methods focus solely on the best compression ratio. The proposed method outperforms them in the quality/decoding time ratio, which is the most important metric for the real-time rendering. The results presented show that currently existing alternatives cannot be used efficiently for 4K light field video streaming. A proof-of-concept light field player was implemented and is available to use. The proposal solves the memory and streaming requirements that are the most crucial issues in light field rendering. The paper additionally outlines enhancements to a current light field rendering technique, which has been modified to integrate effectively with the newly proposed encoding method.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"138 ","pages":"Article 117377"},"PeriodicalIF":3.4,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144482420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dengyong Zhang , Chuanzhen Xu , Jiaxin Chen , Lei Wang , Bin Deng
{"title":"YOLO-DC: Integrating deformable convolution and contextual fusion for high-performance object detection","authors":"Dengyong Zhang , Chuanzhen Xu , Jiaxin Chen , Lei Wang , Bin Deng","doi":"10.1016/j.image.2025.117373","DOIUrl":"10.1016/j.image.2025.117373","url":null,"abstract":"<div><div>Object detection is a fundamental task in computer vision, but existing methods often concentrate on optimizing model architectures, loss functions, and data preprocessing techniques, while frequently neglecting the potential improvements that advanced convolutional mechanisms can provide. Additionally, increasing the depth of deep learning networks can lead to the loss of essential feature information, highlighting the need for strategies that can further improve model accuracy. This paper introduces YOLO-DC, an algorithm that enhances object detection by incorporating deformable convolution and contextual mechanisms. YOLO-DC integrates a Deformable Convolutional Module (DCM) and a Contextual Information Fusion Downsampling Module (CFD). The DCM employs deformable convolution with multi-scale spatial channel attention to effectively expand the receptive field and enhance feature extraction. In parallel, the CFD module leverages both contextual and local features during downsampling and incorporates global features to enhance joint learning and reduce information loss. Compared to YOLOv8-N, YOLO-DC-N achieves a significant improvement in Average Precision (AP), increasing by 3.5% to reach 40.8% on the Microsoft COCO 2017 dataset, while maintaining a comparable inference time. The model outperforms other state-of-the-art detection algorithms across various datasets, including the RUOD underwater dataset and the PASCAL VOC dataset (VOC2007 + VOC2012). The source code is available at <span><span>https://github.com/Object-Detection-01/YOLO-DC.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"138 ","pages":"Article 117373"},"PeriodicalIF":3.4,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144472247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jixiang Shi, Jin Liu, Wen Lu, Ruisen Liu, Jiajun Wang
{"title":"HFINet: Heteroscale feature integration network for real-time semantic segmentation","authors":"Jixiang Shi, Jin Liu, Wen Lu, Ruisen Liu, Jiajun Wang","doi":"10.1016/j.image.2025.117375","DOIUrl":"10.1016/j.image.2025.117375","url":null,"abstract":"<div><div>Effectively segmenting visual images from a semantic perspective remains an under-explored research issue. The absence of heteroscale recognition leads to persistent challenges in accurately delineating boundaries, particularly for small and slender objects next to larger ones. Existing semantic segmentation methods suffer from spatial resolution loss in downsampling, which smooths out high-frequency features and blurs object boundaries, resulting in the missegmentation of smaller objects. To address this, a novel boundary branch is proposed in our multilateral network. It incorporates spatial integration and channel significance to integrate heteroscale features, mitigating missegmentation and utilizing boundary loss to enhance the learning process, thereby improving the model’s robustness in complex scenes. Additionally, the aggregation pyramid pooling module fuses contextual information from low-resolution feature maps to enlarge the receptive field, achieving greater semantic label accuracy. Experimental results of our proposed HFINet demonstrate that integrating boundary features significantly improves segmentation accuracy, particularly for precise object boundary delineation. This work offers a promising direction for enhancing the robustness of semantic segmentation models in challenging scenarios involving complex object boundaries.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"138 ","pages":"Article 117375"},"PeriodicalIF":3.4,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144489717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}