{"title":"LWU-YOLO: A lightweight algorithm for small object detection in UAV applications","authors":"Yapeng Li , Ting Wang , Tao Li , Xin Yang","doi":"10.1016/j.jvcir.2026.104791","DOIUrl":"10.1016/j.jvcir.2026.104791","url":null,"abstract":"<div><div>Since detecting small objects in UAV imagery is challenging due to complex backgrounds and limited pixels, this paper proposes a new lightweight model based on YOLOv8s called LWU-YOLO. Initially, a task-oriented head restructuring strategy is introduced to enhance detailed feature representation, while reducing model parameters. Subsequently, an efficient multi-scale downsampling feature fusion (MDFF) module is designed to minimize the information loss during the upsampling process. Moreover, a mixed local channel attention (MLCA) mechanism is integrated into the C2f module to improve focus on critical features. Additionally, a novel Inner-PIoUv2 loss function is devised for faster convergence and higher accuracy in small object regression. Finally, experiments on the VisDrone2019 dataset show that the LWU-YOLO increases mAP@50 and mAP@50:95 by 7.3% and 4.7%, respectively, while using 55.3% fewer parameters than YOLOv8s, demonstrating an excellent balance of performance and efficiency for UAV applications.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"117 ","pages":"Article 104791"},"PeriodicalIF":3.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147600945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Qin , Zhenxue Chen , Qingqiang Guo , Q.M. Jonathan Wu , Mengxu Lu
{"title":"GGCN: Gait Recognition with Generate Network and Convolutional Neural Network","authors":"Hao Qin , Zhenxue Chen , Qingqiang Guo , Q.M. Jonathan Wu , Mengxu Lu","doi":"10.1016/j.jvcir.2026.104790","DOIUrl":"10.1016/j.jvcir.2026.104790","url":null,"abstract":"<div><div>Gait recognition is a biometric technology with wide application prospects, but it is easily affected by various covariates, which requires the gait recognition model is robust. In this paper, we design a robust gait recognition model named GGCN (Gait recognition with Generate network and Convolutional neural Network), which uses multi-type gait sequences as input and eliminates the effects of various covariates through a supervised mapping module. The GGCN processes the gait sequence in three steps. First, the generate network is used to extract low-level features and remove the features generated by interference. Then, the low-level features are input into the encoder network to obtain high-level features. Finally, the high-level features are input into the feature mapping network to acquire more recognizable features. The experimental results on the CASIA-B, OULP, and OUMVLP datasets demonstrate that our model outperforms current state-of-the-art methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"117 ","pages":"Article 104790"},"PeriodicalIF":3.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147600949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Beyond bounding boxes: Segmentation supervision for robust object detection in fisheye images","authors":"Arda Oztuner, Mehmet Kilicarslan","doi":"10.1016/j.jvcir.2026.104798","DOIUrl":"10.1016/j.jvcir.2026.104798","url":null,"abstract":"<div><div>Fisheye cameras pose significant object detection challenges due to severe radial distortion, rendering traditional axis-aligned bounding boxes suboptimal for warped object shapes. We propose a pipeline that transforms bounding box annotations into instance segmentation masks using the Segment Anything Model (SAM) and validate mask fidelity against expert ground truth in both rectilinear and distorted domains. We benchmark various models on the Fisheye8K dataset, demonstrating the architectural generalizability of our approach across YOLOv8, YOLOv11, and YOLOv12. Results show that segmentation-based supervision yields substantial performance gains, improving the mean average precision (mAP@[0.5:0.95]) by up to 10 absolute points over models trained with bounding boxes, and up to 12 points in distorted outer regions. Furthermore, our approach outperforms state-of-the-art methods and establishes a new benchmark for fisheye object detection. This work highlights the specific theoretical and empirical benefits of automated segmentation-based annotation within complex, distorted imaging domains.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"117 ","pages":"Article 104798"},"PeriodicalIF":3.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147656944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-view recursive gated convolutions for 3D object recognition and retrieval","authors":"Jiangzhong Cao, Yue Cai, Huan Zhang","doi":"10.1016/j.jvcir.2026.104792","DOIUrl":"10.1016/j.jvcir.2026.104792","url":null,"abstract":"<div><div>Multi-view-based 3D shape recognition methods perform 3D object recognition and retrieval by processing series of images from various angles to generate a compact 3D descriptor. However, existing approaches often focus on integrating information from multiple views without addressing spatial interactions and redundancy when similar views are used. To overcome these challenges, we propose a novel framework, Multi-view Recursive Gated Convolutions (MVRGC). Our method first extracts features from multiple views at different scales, allowing for initial interaction of information across these views. Recursive gated convolutions are then applied to capture deeper spatial reciprocity and fine-tune feature interactions among views. Additionally, a preferred view module is introduced to reduce view redundancy by favoring distinctive and representative views. This module selects a subset of views that best describe the object while minimizing overlap. Experimental results on shape benchmark datasets demonstrate that MVRGC outperforms existing methods in 3D object recognition and retrieval tasks.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"117 ","pages":"Article 104792"},"PeriodicalIF":3.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147600946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Densely aggregated U-net with spatial-spectral interaction transformer for hyperspectral compressed imaging reconstruction","authors":"Yun-Hui Li","doi":"10.1016/j.jvcir.2026.104795","DOIUrl":"10.1016/j.jvcir.2026.104795","url":null,"abstract":"<div><div>Hyperspectral imaging offers critical spectral information for applications such as material analysis and camouflage recognition. However, the acquisition of hyperspectral data cubes is inherently constrained by the Nyquist sampling theorem. While compressed sensing theory enables snapshot imaging by compressing the data cube into a 2D measurement, the ill-posed reconstruction remains a significant challenge. Recent deep learning methods, particularly vision transformers, have advanced the state-of-the-art (SOTA). Despite this, existing networks typically employ spectral or spatial self-attentions in isolation, blindly pursuing a global receptive field at the cost of computational efficiency and representational flexibility. Additionally, the vanilla skip connection in U-Nets is insufficient for effective multi-scale information transmission between encoder and decoder. To address these issues, we propose a Densely aggregated U-Net with a Spatial-Spectral Interaction Transformer (DSST). DSST parallelizes patch-based spectral self-attention and window-based spatial self-attention, complemented by an interaction mechanism. Furthermore, it introduces a densely aggregated skip connection to collect multi-scale features and bridge the semantic gap. Experimental results on both simulated and real-world scenes demonstrate that DSST achieves competitive performance with lower computational and memory costs compared to other end-to-end networks. Moreover, it offers faster inference speeds than deep unfolding networks.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"117 ","pages":"Article 104795"},"PeriodicalIF":3.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147600948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HierarchicalGeoCount: Hierarchical scale perception for zero-shot object counting in remote sensing","authors":"Binyuan Huang, Jiayi Wang, Zhenzhong Chen","doi":"10.1016/j.jvcir.2026.104793","DOIUrl":"10.1016/j.jvcir.2026.104793","url":null,"abstract":"<div><div>Recent advances in vision–language models have enabled zero-shot object counting with improved scalability. However, existing methods typically operate at fixed granularity and struggle with objects of varying scales, particularly in remote sensing imagery where scale variation is extreme. To address these challenges, we propose HierarchicalGeoCount, a zero-shot counting approach that leverages hierarchical scale perception for remote sensing imagery. The method consists of two components: a Hierarchical Scale Perception (HSP) module that estimates object scale distributions based on global context analysis and guides scale-aware image partitioning; and a RemotePrior-Guided Refinement (RGR) module that refines detection results using remote sensing-specific vision–language models and dense visual features. Experiments on NWPU-MOC and RSOC datasets demonstrate competitive zero-shot performance, showing the potential of scale-aware processing for zero-shot remote sensing object counting applications.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"117 ","pages":"Article 104793"},"PeriodicalIF":3.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147600947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stability optimization in action imitation for humanoid robot","authors":"Yi Lu, Shenghao Ren, Zhiyu Jin, Qiu Shen","doi":"10.1016/j.jvcir.2026.104738","DOIUrl":"10.1016/j.jvcir.2026.104738","url":null,"abstract":"<div><div>Imitation of human actions is crucial for humanoid robots to enhance motion capabilities and understand human action mechanisms. Current methods focus on capturing human motion and imposing these parameters to robots, but differences in size, structure, and mechanics often result in unstable and distorted robot actions. To address these issues, we propose improving the stability and adaptability of motion data and conducting motion retargeting across multiple spaces. Specifically, we utilize mode adaptive motion smoothing (MAMS) for lower and upper body joints, adapting to different support modes. To balance similarity and stability, we propose a multi-objective motion optimization (MOMO) model under kinematic stability constraints, which takes into account the robot’s stable trajectories and the fundamental poses of the human body. Experiments demonstrate that our approach enhances the reliability and stability of robot motion while maintaining a high degree of similarity to human movements, significantly advancing the field of humanoid robot imitation.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104738"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pedestrian trajectory prediction using multi-cue transformer","authors":"Yanlong Tian , Rui Zhai , Xiaoting Fan , Qi Xue , Zhong Zhang , Xinshan Zhu","doi":"10.1016/j.jvcir.2026.104723","DOIUrl":"10.1016/j.jvcir.2026.104723","url":null,"abstract":"<div><div>Pedestrian trajectory prediction is a challenging issue because the future trajectories are influenced by the surrounding environment and constrained by the common sense rules. The existing trajectory prediction methods typically consider one kind of cues, i.e., social-aware cue, environment-aware cue, and goal-conditioned cue to model the interactions with the trajectory information, which results in insufficient interactions. In this article, we propose an innovative Transformer network named Multi-cue Transformer (McTrans) aimed at pedestrian trajectory prediction, where we design the Hierarchical Cross-Attention (HCA) module to learn the goal–social–environment interactions between the trajectory information of pedestrians and three kinds of cues from the perspectives of temporal and spatial dependencies. Furthermore, in order to reasonably utilize the guidance of the goal information, we propose the Gradual Goal-guided Loss (GGLoss) which gradually increases the weights of the coordinate difference between the predicted goal and the ground-truth goal as the time steps increase. We conduct extensive experiments on three public datasets, i.e., SDD, inD, and ETH/UCY. The experimental results demonstrate that the proposed McTrans is superior to other state-of-the-art methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104723"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LaDeL: Lane detection via multimodal large language model with visual instruction tuning","authors":"Yun Zhang , Xin Cheng , Zhou Zhou , Jingmei Zhou , Tong Yang","doi":"10.1016/j.jvcir.2025.104704","DOIUrl":"10.1016/j.jvcir.2025.104704","url":null,"abstract":"<div><div>Lane detection plays a fundamental role in autonomous driving by providing geometric and semantic guidance for robust localization and planning. Empirical studies have shown that reliable lane perception can reduce vehicle localization error by up to 15% and improve trajectory stability by more than 10%, underscoring its critical importance in safety-critical navigation systems. Visual degradations such as occlusions, worn paint, and illumination shifts result in missing or ambiguous lane boundaries, reducing the reliability of appearance-only methods and motivating scene-aware reasoning. Inspired by the human ability to jointly interpret scene context and road structure, this work presents LaDeL (Lane Detection with Large Language Models), which, to our knowledge, is the first framework to leverage multimodal large language models for lane detection through visual-instruction reasoning. LaDeL reformulates lane perception as a multimodal question-answering task that performs lane localization, lane counting, and scene captioning in a unified manner. We introduce lane-specific tokens to enable precise numerical coordinate prediction and construct a diverse instruction-tuning corpus combining lane queries, lane-count prompts, and scene descriptions. Experiments demonstrate that LaDeL achieves state-of-the-art performance, including an F1-score of 82.35% on CULane and 98.23% on TuSimple, outperforming previous methods. Although LaDeL requires greater computational resources than conventional lane detection networks, it provides new insight into integrating geometric perception with high-level reasoning. Beyond lane detection, this formulation opens opportunities for language-guided perception and reasoning in autonomous driving, including road-scene analysis, interactive driving assistants, and language-aware perception.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104704"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic Response GAN (SR-GAN) for embroidery pattern generation","authors":"Shaofan Chen","doi":"10.1016/j.jvcir.2026.104707","DOIUrl":"10.1016/j.jvcir.2026.104707","url":null,"abstract":"<div><div>High-resolution, detail-rich image generation models are essential for text-driven embroidery pattern synthesis. In this paper, the Semantic Response Generative Adversarial Network (SR-GAN) is used for embroidery image synthesis. It generates higher-quality images and improves text-image alignment. The model integrates word-level text embeddings into the image latent space through a cross-attention mechanism and a confidence-aware fusion scheme. In this way, word-level semantic features are effectively injected into hidden image features. The Semantic Perception Module is also refined by replacing standard convolutions with depthwise separable convolutions, which reduces the number of model parameters. In addition, the Deep Attention Multimodal Similarity Model directly scores word-pixel correspondences to compute fine-grained matching loss. It injects embroidery-domain word embeddings into the text encoder for joint training and further tightens the alignment between generated images and text. Experimental results show that the proposed method achieves an FID of 13.84 and an IS of 5.51.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104707"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}