Weijie Bao , Yuantong Zhang , Jianghao Jia , Zhenzhong Chen , Shan Liu
{"title":"Joint reference frame synthesis and post filter enhancement for Versatile Video Coding","authors":"Weijie Bao , Yuantong Zhang , Jianghao Jia , Zhenzhong Chen , Shan Liu","doi":"10.1016/j.jvcir.2025.104433","DOIUrl":"10.1016/j.jvcir.2025.104433","url":null,"abstract":"<div><div>This paper presents the joint reference frame synthesis (RFS) and post-processing filter enhancement (PFE) for Versatile Video Coding (VVC), aiming to explore the combination of different neural network-based video coding (NNVC) tools to better utilize the hierarchical bi-directional coding structure of VVC. Both RFS and PFE utilize the Space–Time Enhancement Network (STENet), which receives two input frames with artifacts and produces two enhanced frames with suppressed artifacts, along with an intermediate synthesized frame. STENet comprises two pipelines, the synthesis pipeline and the enhancement pipeline, tailored for different purposes. During RFS, two reconstructed frames are sent into STENet’s synthesis pipeline to synthesize a virtual reference frame, similar to the current to-be-coded frame. The synthesized frame serves as an additional reference frame inserted into the reference picture list (RPL). During PFE, two reconstructed frames are fed into STENet’s enhancement pipeline to alleviate their artifacts and distortions, resulting in enhanced frames with reduced artifacts and distortions. To reduce inference complexity, we propose joint inference of RFS and PFE (JISE), achieved through a single execution of STENet. Integrated into the VVC reference software VTM-15.0, RFS, PFE, and JISE are coordinated within a novel Space–Time Enhancement Window (STEW) under Random Access (RA) configuration. The proposed method could achieve –7.34%/–17.21%/–16.65% BD-rate (PSNR) on average for three components under RA configuration.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"108 ","pages":"Article 104433"},"PeriodicalIF":2.6,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143631730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Two-tiered Spatio-temporal Feature Extraction for Micro-expression Classification","authors":"Ankita Jain , Dhananjoy Bhakta , Prasenjit Dey","doi":"10.1016/j.jvcir.2025.104436","DOIUrl":"10.1016/j.jvcir.2025.104436","url":null,"abstract":"<div><div>This paper proposed a framework called DAuLiLSTM (<strong>DAu</strong>Vi + <strong>LiLSTM</strong>) for Micro-expression (ME) classification. It extracts spatio-temporal (ST) features through two novel components: dynamic image of augmented video (DAuVi) and Lightnet with LSTM (LiLSTM). The first component presents a unique strategy to generate multiple dynamic images of each original ME video that contain the relevant ST features. It proposes an algorithm that works as a sliding window and ensures the incorporation of the apex frame in each dynamic image. The second component further processes those images to extract additional ST features. The LiLSTM consists of two deep networks: Lightnet and LSTM. The Lightnet extracts the spatial information and LSTM learns the temporal sequences. A combination of both components extracts ST features sequentially twice and ensures that the model captures all ST features. We found that our model outperforms 14 state-of-the-art techniques in accuracy and F1-score on three ME datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"109 ","pages":"Article 104436"},"PeriodicalIF":2.6,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143637408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A robust and adaptive framework with space–time memory networks for Visual Object Tracking","authors":"Yu Zheng, Yong Liu, Xun Che","doi":"10.1016/j.jvcir.2025.104431","DOIUrl":"10.1016/j.jvcir.2025.104431","url":null,"abstract":"<div><div>These trackers based on the space–time memory network locate the target object in the search image employing contextual information from multiple memory frames and their corresponding foreground–background features. It is conceivable that these trackers are susceptible to the memory frame quality as well as the accuracy of the corresponding foreground labels. In the previous works, experienced methods are employed to obtain memory frames from historical frames, which hinders the improvement of generalization and performance. To address the above limitations, we propose a robust and adaptive extraction strategy for memory frames to ensure that the most representative historical frames are selected into the set of memory frames to increase the accuracy of localization and reduce failures due to error accumulation. Specifically, we propose an extraction network to evaluate historical frames, where historical frames with the highest score are labeled as the memory frame and conversely dropped. Qualitative and quantitative analyses were implemented on multiple datasets (OTB100, LaSOT and GOT-10K), and the proposed method obtains significant gain in performance over the previous works, especially for challenging scenarios. while bringing only a negligible inference speed degradation, otherwise, the proposed method obtains competitive results compared to other state-of-the-art (SOTA) methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"108 ","pages":"Article 104431"},"PeriodicalIF":2.6,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143619607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Knowledge-guided quantization-aware training for EEG-based emotion recognition","authors":"Sheng-hua Zhong , Jiahao Shi , Yi Wang","doi":"10.1016/j.jvcir.2025.104415","DOIUrl":"10.1016/j.jvcir.2025.104415","url":null,"abstract":"<div><div>Emotion recognition is of paramount importance in various domains. In recent years, the use of models that employ electroencephalogram data as input has seen substantial achievements. However, the increasing complexity of these EEG models presents substantial challenges that hinder their deployment in resource-limited environments. This situation emphasizes the critical need for effective model compression. However, extreme compression often leads to significant degradation in model performance. To address this issue, we propose a novel Knowledge-Guided Quantization-Aware Training method for EEG-based emotion recognition task. This method integrates knowledge from emotional neuroscience into the quantization process, emphasizing the importance of the prefrontal cortex part in the EEG sample selection process to construct the calibration set and successfully enhance the performance of Quantization-Aware Training techniques. Experimental results demonstrate that our proposed framework achieves quantization to 8 bits, which leads to surpassing SOTAs in EEG-based emotion recognition. The source code is made available at: <span><span>https://github.com/Stewen24/KGCC</span><svg><path></path></svg></span> .</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"108 ","pages":"Article 104415"},"PeriodicalIF":2.6,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143578396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liang Zhu , Kuan Shen , Guangwen Wang , Yujie Hao , Lijun Zheng , Yanping Lu
{"title":"DUWS Net: Wavelet-based dual U-shaped spatial-frequency fusion transformer network for medical image segmentation","authors":"Liang Zhu , Kuan Shen , Guangwen Wang , Yujie Hao , Lijun Zheng , Yanping Lu","doi":"10.1016/j.jvcir.2025.104428","DOIUrl":"10.1016/j.jvcir.2025.104428","url":null,"abstract":"<div><div>Medical image segmentation is crucial for disease monitoring, diagnosis, and treatment planning. However, due to the complexity of medical images and their rich frequency information, networks face challenges in segmenting regions of interest using single-domain information. This study proposes a wavelet-transform-based dual U-Net fusion Transformer network for medical image segmentation, aiming to address the shortcomings of current methods. The network supplements spatial information through an external U-Net encoder-decoder structure, enabling deeper extraction of spatial features from the images. The internal U-shaped structure utilizes wavelet transform to capture low-frequency and high-frequency components of feature maps, performing linear self-attention interactions between these frequencies. This allows the network to learn global structures from low frequencies while capturing detailed features from high frequencies. Finally, spatial and frequency domain features are fused through alternating weighting based on spatial and channel dimensions. Experimental results show that the proposed method outperforms traditional single-domain segmentation models.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"108 ","pages":"Article 104428"},"PeriodicalIF":2.6,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143550990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Delicate image segmentation based on cosine kernel graph cut","authors":"Mehrnaz Niazi , Kambiz Rahbar , Fatemeh Taheri , Mansour Sheikhan , Maryam Khademi","doi":"10.1016/j.jvcir.2025.104430","DOIUrl":"10.1016/j.jvcir.2025.104430","url":null,"abstract":"<div><div>The kernel graph cut approach is effective but highly dependent on the choice of kernel used to map data into a new feature space. This study introduces an enhanced kernel-based graph cut method specifically designed for segmenting complex images. The proposed method extends the RBF kernel by incorporating a unique mapping function that includes two components from the MacLaurin cosine kernel series, known for its ability to decorrelate regions and compress energy. This enhanced feature space enables the objective function to include a data fidelity term, which preserves the standard deviation of each region’s data in the segmented image, along with a regularization term that maintains smooth boundaries. The proposed method retains the computational efficiency typical of graph-based techniques while enhancing segmentation accuracy for intricate images. Experimental evaluations on widely-used datasets with complex shapes and fine boundaries demonstrate the effectiveness of this kernel-based approach compared to existing methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"108 ","pages":"Article 104430"},"PeriodicalIF":2.6,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143534755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nengxin Li , Xichen Yang , Tianhai Chen , Tianshu Wang , Genlin Ji
{"title":"Applying usability assessment method for surveillance video anomaly detection with multiple distortion","authors":"Nengxin Li , Xichen Yang , Tianhai Chen , Tianshu Wang , Genlin Ji","doi":"10.1016/j.jvcir.2025.104417","DOIUrl":"10.1016/j.jvcir.2025.104417","url":null,"abstract":"<div><div>With the extensive deployment of surveillance cameras, video anomaly detection (VAD) is commonly employed to various practical scenarios such as subway stations, parks, and roads. However, the surveillance camera can be easily influenced by weather and hardware degradation during data collection, resulting in information loss. Insufficient information will lead to a decrease in accuracy and credibility for anomaly detection. Accurately measuring the impact of information loss on anomaly detection can be helpful in practical application, and provide reliable application scheme of surveillance data. Therefore, we construct a dataset which contains surveillance data with multiple distortions. Based on the dataset, sufficient reliable data can be provided to measure the impact of data quality for anomaly detection methods. On the basis of the impact of data quality on anomaly detection, thresholds have been designed for data screening to improve the performance of anomaly detection. Finally, an image usability assessment (IUA) method was proposed to accurately screen surveillance data via the designed thresholds. Experimental results demonstrate that the constructed dataset was reasonable and reliable. The proposed IUA method can accurately screen the data to improve the performance of VAD methods, and meet the requirements of practical application scenarios on surveillance data. The dataset has been open-sourced at <span><span>https://github.com/dart-into/MultipleDistortionDataset</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"108 ","pages":"Article 104417"},"PeriodicalIF":2.6,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143534756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuezhi Xiang , Yao Wang , Xiaoheng Li , Lei Zhang , Xiantong Zhen
{"title":"Self-supervised monocular depth estimation with large kernel attention and dynamic scene perception","authors":"Xuezhi Xiang , Yao Wang , Xiaoheng Li , Lei Zhang , Xiantong Zhen","doi":"10.1016/j.jvcir.2025.104413","DOIUrl":"10.1016/j.jvcir.2025.104413","url":null,"abstract":"<div><div>Self-supervised monocular depth estimation has emerged as a promising approach since it does not rely on labeled training data. Most methods combine convolution and Transformer to model long-distance dependencies to estimate depth accurately. However, Transformer treats 2D image features as 1D sequences, and positional encoding somewhat mitigates the loss of spatial information between different feature blocks, tending to overlook channel features, which limit the performance of depth estimation. In this paper, we propose a self-supervised monocular depth estimation network to get finer details. Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies without compromising the two-dimension structure of features while maintaining feature channel adaptivity. In addition, we introduce a dynamic scene perception (DSP) module, which dynamically adjusts the receptive fields to capture more accurate depth discontinuities context information, thereby enhancing the network’s ability to process complex scenes. Besides, we introduce an up-sampling module to accurately recover the fine details in the depth map. Our method achieves highly competitive results on the KITTI dataset (AbsRel = 0.095, SqRel = 0.613, RMSElog = 0.169, <span><math><mi>δ</mi></math></span>1 = 0.907), and shows great generalization performance on the challenging indoor dataset NYUv2 dataset.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"108 ","pages":"Article 104413"},"PeriodicalIF":2.6,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143520306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Advancing white balance correction through deep feature statistics and feature distribution matching","authors":"Furkan Kınlı , Barış Özcan , Furkan Kıraç","doi":"10.1016/j.jvcir.2025.104412","DOIUrl":"10.1016/j.jvcir.2025.104412","url":null,"abstract":"<div><div>Auto-white balance (AWB) correction is a crucial process in digital imaging, ensuring accurate and consistent color correction across varying lighting conditions. This study presents an innovative AWB correction method that conceptualizes lighting conditions as the style factor, allowing for more adaptable and precise color correction. Previous studies predominantly relied on Gaussian distribution assumptions for feature distribution alignment, which can limit the ability to fully exploit the style information as a modifying factor. To address this limitation, we propose a U-shaped Transformer-based architecture, where the learning objective of style factor enforces matching deep feature statistics using the Exact Feature Distribution Matching algorithm. Our proposed method consistently outperforms existing AWB correction techniques, as evidenced by both extensive quantitative and qualitative analyses conducted on the Cube+ and a synthetic mixed-illuminant dataset. Furthermore, a systematic component-wise analysis provides deeper insights into the contributions of each element, further validating the robustness of the proposed approach.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"108 ","pages":"Article 104412"},"PeriodicalIF":2.6,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143520307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ying Ren , Zhenhai Wang , YiJun Jing , Hui Chen , Lutao Yuan , Hongyu Tian , Xing Wang
{"title":"SiamTP: A Transformer tracker based on target perception","authors":"Ying Ren , Zhenhai Wang , YiJun Jing , Hui Chen , Lutao Yuan , Hongyu Tian , Xing Wang","doi":"10.1016/j.jvcir.2025.104426","DOIUrl":"10.1016/j.jvcir.2025.104426","url":null,"abstract":"<div><div>Previous trackers based on Siamese network and transformer do not interact with the feature extraction stage during the feature fusion, excessive weight of the target features in the template area when the target deformation is large during feature fusion, causing target loss. This paper proposes a target tracking framework with target perception based on Siamese network and transformer. First, feature extraction was performed on the template area and search area and the extracted features were enhanced. A concatenation operation is used to combine them. Second, we used the feature perception obtained during the final stage of attention enhancement by searching for images to rank them and extracted the features with higher scores to enhance the feature fusion effect. Experimental results showed that the proposed tracker achieves good results on four common and challenging datasets while running at real-time speed with a speed of approximately 50 fps on a GPU.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"108 ","pages":"Article 104426"},"PeriodicalIF":2.6,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143550992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}