{"title":"Multi-task visual food recognition by integrating an ontology supported with LLM","authors":"Daniel Ponte , Eduardo Aguilar , Mireia Ribera , Petia Radeva","doi":"10.1016/j.jvcir.2025.104484","DOIUrl":"10.1016/j.jvcir.2025.104484","url":null,"abstract":"<div><div>Food image analysis is a crucial task with far-reaching implications across various domains, including culinary arts, nutrition, and food technology. This paper presents a novel approach to multi-task visual food analysis, using large language models to obtain recipes and support the creation of a comprehensive food ontology. The approach integrates the food ontology into an end-to-end model, with prior knowledge on the relationships of food concepts at different semantic levels, within a multi-task deep learning visual food analysis approach, to generate better and more consistent class predictions. Evaluated on two benchmark datasets, MAFood-121 and VireoFood-172, this method demonstrates its effectiveness in single-label food recognition and multi-label food group classification. The ontology enhances accuracy, consistency, and generalization by effectively transferring knowledge to the learning model. This study underscores the potential of ontology-based methods to address food image classification complexities, with implications for broad applications, including automated recipe generation and nutritional assessment.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"110 ","pages":"Article 104484"},"PeriodicalIF":2.6,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An unsupervised fine-tuning strategy for low-light image enhancement","authors":"Shaoping Xu , Qiyu Chen, Hanyang Hu, Liang Peng, Wuyong Tao","doi":"10.1016/j.jvcir.2025.104480","DOIUrl":"10.1016/j.jvcir.2025.104480","url":null,"abstract":"<div><div>The primary goal of low-light image enhancement (LLIE) algorithms is to improve the visibility of images taken in poor lighting conditions, thereby enhancing the performance of subsequent tasks. However, relying on a single LLIE algorithm often fails to consistently address aspects like color restoration, noise reduction, brightness adjustment, and detail preservation due to varying implementation strategies. To overcome this limitation, we propose an unsupervised fine-tuning strategy that integrates multiple LLIE methods for better and more comprehensive results. Our approach consists of two phases: in the preprocessing phase, we select two complementary LLIE algorithms, Retinexformer and RQ-LLIE, to process the input low-light image independently. The enhanced outputs are designated as preprocessed images. In the unsupervised fusion fine-tuning phase, a lightweight UNet network extracts features from these preprocessed images to produce a fused image, constrained by a hybrid loss function. This function ensures consistency in image content and adjusts quality based on color, spatial consistency, and exposure. We also employ an image quality screening mechanism to select the optimal final enhanced image from the iterative outputs. Extensive experiments on benchmark datasets confirm that our algorithm outperforms existing individual LLIE methods in both qualitative and quantitative evaluations. Moreover, our approach is highly extensible, allowing for the integration of future LLIE algorithms to achieve even better results.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"110 ","pages":"Article 104480"},"PeriodicalIF":2.6,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144134003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ao Luo , Linxin Song , Keisuke Nonaka , Jinming Liu , Kyohei Unno , Kohei Matsuzaki , Heming Sun , Jiro Katto
{"title":"MDLPCC: Misalignment-aware dynamic LiDAR point cloud compression","authors":"Ao Luo , Linxin Song , Keisuke Nonaka , Jinming Liu , Kyohei Unno , Kohei Matsuzaki , Heming Sun , Jiro Katto","doi":"10.1016/j.jvcir.2025.104481","DOIUrl":"10.1016/j.jvcir.2025.104481","url":null,"abstract":"<div><div>LiDAR point cloud plays an important role in various real-world areas. It is usually generated as sequences by LiDAR on moving vehicles. Regarding the large data size of LiDAR point clouds, Dynamic Point Cloud Compression (DPCC) methods are developed to reduce transmission and storage data costs. However, most existing DPCC methods neglect the intrinsic misalignment in LiDAR point cloud sequences, limiting the rate–distortion (RD) performance. This paper proposes a Misalignment-aware Dynamic LiDAR Point Cloud Compression method (MDLPCC), which alleviates the misalignment problem in both macroscope and microscope. MDLPCC exploits a global transformation (GlobTrans) method to eliminate the macroscopic misalignment problem, which is the obvious gap between two continuous point cloud frames. MDLPCC also uses a spatial–temporal mixed structure to alleviate the microscopic misalignment, which still exists in the detailed parts of two point clouds after GlobTrans. The experiments on our MDLPCC show superior performance over existing point cloud compression methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"110 ","pages":"Article 104481"},"PeriodicalIF":2.6,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144134004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Developing lightweight object detection models for USV with enhanced maritime surface visible imaging","authors":"Longhui Niu, Yunsheng Fan, Ting Liu, Qi Han","doi":"10.1016/j.jvcir.2025.104477","DOIUrl":"10.1016/j.jvcir.2025.104477","url":null,"abstract":"<div><div>Maritime surface object detection is a key technology for the autonomous navigation of unmanned surface vehicles (USVs). However, Maritime surface object detectors often face challenges such as large parameter sizes, object size variations, and image degradation caused by complex sea environments, severely affecting the deployment and detection accuracy on USVs. To address these challenges, this paper proposes the LightV7-enhancer object detection framework. This framework is based on the CPA-Enhancer image enhancement module and an improved YOLOv7 detection module for joint optimal learning. First, a new lightweight backbone network, GhostOECANet, was designed based on Ghost modules and improved coordinate attention. Second, by integrating ELAN and Efficient Multi-scale attention, an ELAN-EMA module is constructed to enhance the network’s perception and multi-scale feature extraction capabilities. Additionally, to improve the detection accuracy of small objects, multi-scale object detection layers are added based on the YOLOv5 detection head. The paper also introduces CPA-Enhancer in conjunction with the improved YOLOv7 detection module for joint training to adaptively restore degraded Maritime surface images, thereby improving detection accuracy in complex maritime backgrounds. Finally, the SeaShips dataset and Singapore Maritime Dataset are used to evaluate and compare LightV7-enhancer with other mainstream detectors. The results show that LightV7-enhancer supports object detection in various degraded maritime scenarios, achieving a balance between accuracy and computational complexity compared to other mainstream models. Compared to the baseline YOLOv7, LightV7-enhancer improves mAP by 2.7% and 7.5% on the two datasets, respectively, and has only half the number of parameters of YOLOv7, demonstrating robustness in degraded sea surface scenarios.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"110 ","pages":"Article 104477"},"PeriodicalIF":2.6,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144169125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"QLight-Net: Quaternion based low light image enhancement network","authors":"Sudeep Kumar Acharjee, Kavinder Singh, Anil Singh Parihar","doi":"10.1016/j.jvcir.2025.104478","DOIUrl":"10.1016/j.jvcir.2025.104478","url":null,"abstract":"<div><div>Images captured at night suffer from various degradations such as color distortion, low contrast, and noise. Many existing methods improve low-light images may sometimes amplify noise, cause color distortion, and lack finer details. The existing methods require larger number of parameters, which limits the adoption of these methods in vision-based applications. In this paper, we proposed a QLight-Net method to achieve a better enhancement with a comparably lower number of parameters. We proposed depth-wise quaternion convolution, and quaternion cross attention to develop the two-branch architecture for low-light image enhancement. The proposed model leverages gradient branch to extract color-aware gradient features. Further, It uses color branch to extract gradient-aware color features. The proposed method achieves an LPIPS score of 0.047, which surpasses the previous best results with lesser parameters, and achieves 0.88 and 29.05 scores of SSIM and PSNR, respectively. Our approach achieves a balance between computational efficiency and better enhancement.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"110 ","pages":"Article 104478"},"PeriodicalIF":2.6,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144169820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AttenScribble: Attention-enhanced scribble supervision for medical image segmentation","authors":"Mu Tian , Qinzhu Yang , Yi Gao","doi":"10.1016/j.jvcir.2025.104476","DOIUrl":"10.1016/j.jvcir.2025.104476","url":null,"abstract":"<div><div>The success of deep networks in medical image segmentation relies heavily on massive labeled training data. However, acquiring dense annotations is a time-consuming process. Weakly supervised methods normally employ less expensive forms of supervision, among which scribbles started to gain popularity lately thanks to their flexibility. However, due to the lack of shape and boundary information, it is extremely challenging to train a deep network on scribbles that generalize on unlabeled pixels. In this paper, we present a straightforward yet effective scribble-supervised learning framework. Inspired by recent advances in transformer-based segmentation, we create a pluggable spatial self-attention module that could be attached on top of any internal feature layers of arbitrary fully convolutional network (FCN) backbone. The module infuses global interaction while keeping the efficiency of convolutions. Descended from this module, we construct a similarity metric based on normalized and symmetrized attention. This attentive similarity leads to a novel regularization loss that imposes consistency between segmentation prediction and visual affinity. This attentive similarity loss optimizes the alignment of FCN encoders, attention mapping and model prediction. Ultimately, the proposed FCN+Attention architecture can be trained end-to-end guided by a combination of three learning objectives: partial segmentation loss, customized masked conditional random fields, and the proposed attentive similarity loss. Extensive experiments on public datasets (ACDC and CHAOS) showed that our framework not only outperforms existing state-of-the-art but also delivers close performance to fully-supervised benchmarks. The code is available at <span><span>https://github.com/YangQinzhu/AttenScribble.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"110 ","pages":"Article 104476"},"PeriodicalIF":2.6,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144147786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image steganography based on wavelet transform and Generative Adversarial Networks","authors":"Yan Zhao, Pei Yao, Liang Xue","doi":"10.1016/j.jvcir.2025.104474","DOIUrl":"10.1016/j.jvcir.2025.104474","url":null,"abstract":"<div><div>For most steganography based on GANs, repeated encoding and decoding operations can easily lead to information loss, making it hampers the generator’s ability to effectively capture essential image features. To address the limitations in the current work, we propose a new generator with U-Net architecture. Introducing the graph network part to process the information of graph structure, and introducing a feature transfer module designed to preserve and transfer critical feature information. In addition, a new generator loss structure is proposed, it contains three parts: the adversarial loss <span><math><msubsup><mrow><mi>l</mi></mrow><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></msubsup></math></span>, which significantly enhances resistance to detection, the entropy loss <span><math><msubsup><mrow><mi>l</mi></mrow><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow></msubsup></math></span>, which ensures the embedding capability of steganographic images, and the low-frequency wavelet loss <span><math><msub><mrow><mi>l</mi></mrow><mrow><mi>f</mi></mrow></msub></math></span>, which optimizes the overall steganographic performance of the images. Through a large number of experiments and comparisons, our proposed method effectively improves the steganography detection ability, and verifies the reasonableness of the proposed method.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"110 ","pages":"Article 104474"},"PeriodicalIF":2.6,"publicationDate":"2025-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144114854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DII-FRSA: Diverse image inpainting with multi-scale feature representation and separable attention","authors":"Jixiang Cheng, Yuan Wu, Zhidan Li, Yiluo Zhang","doi":"10.1016/j.jvcir.2025.104472","DOIUrl":"10.1016/j.jvcir.2025.104472","url":null,"abstract":"<div><div>Diverse image inpainting is the process of generating multiple visually realistic completion results. Although previous methods in this area have seen success, they still exhibit some limitations. First, one-stage approaches must make a trade-off between diversity and consistency. Second, while two-stage approaches can overcome such problems, they require autoregressive models to estimate the probability distribution of the structural priors, which has a significant impact on inference speed. This paper introduces DII-FRSA, a method for diverse image inpainting utilizing multi-scale feature representation and separable attention. In the first stage, we build a Gaussian distribution from the dataset to sample multiple coarse results. To enhance the modeling capability of the Variational Auto-Encoder, we propose a multi-scale feature representation module for the encoder and decoder. In the second stage, the coarse results are refined while maintaining overall consistency of appearance. Additionally, we design a refinement network based on the proposed separable attention to further improve the quality of the coarse results and maintain consistency in the appearance of the visible and masked regions. Our method was tested on well-established datasets-Places2, CelebA-HQ, and Paris Street View, and outperformed modern techniques. Our network not only enhances the diversity of the completed results but also enhances their visual realism.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"110 ","pages":"Article 104472"},"PeriodicalIF":2.6,"publicationDate":"2025-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144105599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shikun Zhang , Yiqun Wang , Cunjian Chen , Yong Li , Qiuhong Ke
{"title":"3D surface reconstruction with enhanced high-frequency details","authors":"Shikun Zhang , Yiqun Wang , Cunjian Chen , Yong Li , Qiuhong Ke","doi":"10.1016/j.jvcir.2025.104475","DOIUrl":"10.1016/j.jvcir.2025.104475","url":null,"abstract":"<div><div>Neural implicit 3D reconstruction can reproduce shapes without the need for 3D supervision, making it a significant advancement in computer vision and graphics. This technique leverages volume rendering methods and neural implicit representations to learn and reconstruct 3D scenes directly from 2D images, enabling the generation of complex geometries and detailed structures with minimal data. The field has gained significant traction in recent years, due to advancements in deep learning, 3D vision, and rendering techniques that allow for more efficient and realistic reconstructions. Current neural surface reconstruction methods tend to randomly sample the entire image, making it difficult to learn high-frequency details on the surface, and thus the reconstruction results tend to be too smooth. We designed a method, termed FreNeuS (Frequency-guided Neural Surface Reconstruction), which leverages high-frequency information to address the problem of insufficient surface detail. Specifically, FreNeuS uses pixel gradient changes to easily acquire high-frequency regions in an image and uses the obtained high-frequency information to guide surface detail reconstruction. High-frequency information is first used to guide the dynamic sampling of rays, applying different sampling strategies according to variations in high-frequency regions. To further enhance the focus on surface details, we have designed a high-frequency weighting method that constrains the representation of high-frequency details during the reconstruction process. Compared to the baseline method, Neus, our approach reduces the reconstruction error by 13% on the DTU dataset. Additionally, on the NeRF-synthetic dataset, our method demonstrates a significant advantage in visualization, producing clearer texture details. In addition, our method is more applicable and can be generalized to any reconstruction method based on NeuS.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"110 ","pages":"Article 104475"},"PeriodicalIF":2.6,"publicationDate":"2025-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144169821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spatio temporal 3D skeleton kinematic joint point classification model for human activity recognition","authors":"S. Karthika , Y. Nancy Jane , H. Khanna Nehemiah","doi":"10.1016/j.jvcir.2025.104471","DOIUrl":"10.1016/j.jvcir.2025.104471","url":null,"abstract":"<div><div>Human activity recognition in video data is challenging due to factors like cluttered backgrounds and complex movements. This work introduces the Stacked Ensemble 3D Skeletal Human Activity Recognition (SES-HAR) framework to tackle these issues. The framework utilizes MoveNet Lightning Pose Estimation to generate 2D skeletal kinematic joint points, which are then mapped to 3D using a Gaussian Radial Basis Function Kernel. SES-HAR employs a stacking ensemble approach with two layers: level-0 base learners and a level-1 meta-learner. Base learners include Convolutional Two-Part Long Short-Term Memory Network (Conv2P-LSTM), Spatial Bidirectional Gated Temporal Graph Convolutional Network (SBGTGCN) with attention, and Convolutional eXtreme Gradient Boosting (ConvXGB). Their outputs are pooled and processed by a Logistic Regression (LR) meta-learner in the level-1 layer to generate final predictions. Experimental results show that SES-HAR achieves significant performance improvements on NTU-RGB + D 60, NTU-RGB + D 120, Kinetics-700–2020, and Micro-Action-52 datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"110 ","pages":"Article 104471"},"PeriodicalIF":2.6,"publicationDate":"2025-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143947095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}