{"title":"X-CDNet: A real-time crosswalk detector based on YOLOX","authors":"Xingyuan Lu, Yanbing Xue, Zhigang Wang, Haixia Xu, Xianbin Wen","doi":"10.1016/j.jvcir.2024.104206","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104206","url":null,"abstract":"<div><p>As urban traffic safety becomes increasingly important, real-time crosswalk detection is playing a critical role in the transportation field. However, existing crosswalk detection algorithms must be improved in terms of accuracy and speed. This study proposes a real-time crosswalk detector called X-CDNet based on YOLOX. Based on the ConvNeXt basic module, we designed a new basic module called <strong>Rep</strong>arameterizable <strong>S</strong>parse <strong>L</strong>arge-<strong>K</strong>ernel (RepSLK) convolution that can be used to expand the model’s receptive field without the addition of extra inference time. In addition, we created a new crosswalk dataset called CD9K, which is based on realistic driving scenes augmented by techniques such as synthetic rain and fog. The experimental results demonstrate that X-CDNet outperforms YOLOX in terms of both detection accuracy and speed. X-CDNet achieves a 93.3 AP50 and a real-time detection speed of 123 FPS.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141483922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Shift-insensitive perceptual feature of quadratic sum of gradient magnitude and LoG signals for image quality assessment and image classification","authors":"Congmin Chen, Xuanqin Mou","doi":"10.1016/j.jvcir.2024.104215","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104215","url":null,"abstract":"<div><p>Most existing full-reference (FR) Image quality assessment (IQA) models work in the premise of that the two images should be well registered. Shifting an image would lead to an inaccurate evaluation of image quality, because small spatial shifts are far less noticeable than structural distortion for human observers. To this regard, we propose to study an IQA feature that is shift-insensitive to the basic primitive structure of images, i.e., image edge. According to previous studies, the image gradient magnitude (GM) and the Laplacian of Gaussian (LoG) operator that depict the edge profiles of natural images are highly efficient structural features in IQA tasks. In this paper, we find that the Quadratic sum of the normalized GM and the LoG signals (QGL) has excellent shift-insensitive property in representing image edges after theoretically solving the selection problem of a ratio parameter to balance the GM and LoG signals. Based on the proposed QGL feature, two FR-IQA models can be built directly by measuring the similarity map with mean and standard deviation pooling strategies, named mQGL and sQGL, respectively. Experimental results show that the proposed sQGL and mQGL work robustly on four benchmark IQA databases, and QGL-based models show great shift-insensitive property to spatial translation and image rotation while judging the image quality. In addition, we explore the feasibility of combining QGL feature with deep neural networks, and verify that it can help to promote image pattern recognition in texture classification tasks.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141483920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MCT-VHD: Multi-modal contrastive transformer for video highlight detection","authors":"Yinhui Jiang, Sihui Luo, Lijun Guo, Rong Zhang","doi":"10.1016/j.jvcir.2024.104162","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104162","url":null,"abstract":"<div><p>Autonomous highlight detection aims to identify the most captivating moments in a video, which is crucial for enhancing the efficiency of video editing and browsing on social media platforms. However, current efforts primarily focus on visual elements and often overlook other modalities, such as text information that could provide valuable semantic signals. To overcome this limitation, we propose a Multi-modal Contrastive Transformer for Video Highlight Detection (MCT-VHD). This transformer-based network mainly utilizes video and audio modalities, along with auxiliary text features (if exist) for video highlight detection. Specifically, We enhance the temporal connections within the video by integrating a convolution-based local enhancement module into the transformer blocks. Furthermore, we explore three multi-modal fusion strategies to improve highlight inference performance and employ a contrastive objective to facilitate interactions between different modalities. Comprehensive experiments conducted on three benchmark datasets validate the effectiveness of MCT-VHD, and our ablation studies provide valuable insights into its essential components.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140843865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Libo Han , Yanzhao Ren , Sha Tao , Xinfeng Zhang , Wanlin Gao
{"title":"Reversible data hiding with automatic contrast enhancement for color images","authors":"Libo Han , Yanzhao Ren , Sha Tao , Xinfeng Zhang , Wanlin Gao","doi":"10.1016/j.jvcir.2024.104181","DOIUrl":"10.1016/j.jvcir.2024.104181","url":null,"abstract":"<div><p>Automatic contrast enhancement (ACE) is a technique that can automatically enhance the image contrast. Reversible data hiding (RDH) with ACE (ACERDH) can achieve ACE while hiding data. However, some methods with good performance for color images suffer from insufficient enhancement. Therefore, an ACERDH method based on the R, G, B, and V channels enhancement is proposed. First, histogram shifting with contrast control is proposed to enhance the R, G, and B channels. It can prevent contrast degradation and histogram shifting from stopping prematurely. Then, the V channel is enhanced. Since some RDH methods with non-ACE that can well enhance the V channel have a low automation level, histogram shifting with brightness control that can realize ACE very well is proposed. It can effectively avoid over-enhancement by controlling the brightness. Experimental results verify that the proposed method improves the image quality and embedding capability better than some state-of-the-art methods.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141055379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuai Yang , Zibei Wang , Guangao Wang , Yongzhen Ke , Fan Qin , Jing Guo , Liming Chen
{"title":"A self-supervised image aesthetic assessment combining masked image modeling and contrastive learning","authors":"Shuai Yang , Zibei Wang , Guangao Wang , Yongzhen Ke , Fan Qin , Jing Guo , Liming Chen","doi":"10.1016/j.jvcir.2024.104184","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104184","url":null,"abstract":"<div><p>Learning more abundant image features helps improve the image aesthetic assessment task performance. Masked Image Modeling (MIM) is implemented based on the Vision Transformer (ViT), which learns pixel-level features while reconstructing images. Contrastive learning pulls in the same image features while pushing away different image features in the feature space to learn high-level semantic features. Since contrastive learning and MIM capture different levels of image features, combining these two methods could learn more rich feature representations and thus promote the performance of aesthetic assessment. Therefore, we propose a pretext task combining contrastive learning and MIM with learning richer image features. In this approach, the original image is randomly masked and reconstructed on the online network. The reconstructed and original images composition the positive example to calculate the contrastive loss on the target network. In the experiment on the AVA dataset, our method obtained better performance than the baseline.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141090670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory-guided representation matching for unsupervised video anomaly detection","authors":"Yiran Tao , Yaosi Hu , Zhenzhong Chen","doi":"10.1016/j.jvcir.2024.104185","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104185","url":null,"abstract":"<div><p>Recent works on Video Anomaly Detection (VAD) have made advancements in the unsupervised setting, known as Unsupervised VAD (UVAD), which brings it closer to practical applications. Unlike the classic VAD task that requires a clean training set with only normal events, UVAD aims to identify abnormal frames without any labeled normal/abnormal training data. Many existing UVAD methods employ handcrafted surrogate tasks, such as frame reconstruction, to address this challenge. However, we argue that these surrogate tasks are sub-optimal solutions, inconsistent with the essence of anomaly detection. In this paper, we propose a novel approach for UVAD that directly detects anomalies based on similarities between events in videos. Our method generates representations for events while simultaneously capturing prototypical normality patterns, and detects anomalies based on whether an event’s representation matches the captured patterns. The proposed model comprises a memory module to capture normality patterns, and a representation learning network to obtain representations matching the memory module for normal events. A pseudo-label generation module as well as an anomalous event generation module for negative learning are further designed to assist the model to work under the strictly unsupervised setting. Experimental results demonstrate that the proposed method outperforms existing UVAD methods and achieves competitive performance compared with classic VAD methods.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141095324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Few-shot defect classification via feature aggregation based on graph neural network","authors":"Pengcheng Zhang, Peixiao Zheng, Xin Guo, Enqing Chen","doi":"10.1016/j.jvcir.2024.104172","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104172","url":null,"abstract":"<div><p>The effectiveness of deep learning models is greatly dependent on the availability of a vast amount of labeled data. However, in the realm of surface defect classification, acquiring and annotating defect samples proves to be quite challenging. Consequently, accurately predicting defect types with only a limited number of labeled samples has emerged as a prominent research focus in recent years. Few-shot learning, which leverages a restricted sample set in the support set, can effectively predict the categories of unlabeled samples in the query set. This approach is particularly well-suited for defect classification scenarios. In this article, we propose a transductive few-shot surface defect classification method, which using both the instance-level relations and distribution-level relations in each few-shot learning task. Furthermore, we calculate class center features in transductive manner and incorporate them into the feature aggregation operation to rectify the positioning of edge samples in the mapping space. This adjustment aims to minimize the distance between samples of the same category, thereby mitigating the influence of unlabeled samples at category boundary on classification accuracy. Experimental results on the public dataset show the outstanding performance of our proposed approach compared to the state-of-the-art methods in the few-shot learning settings. Our code is available at <span>https://github.com/Harry10459/CIDnet</span><svg><path></path></svg>.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140950969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FSRDiff: A fast diffusion-based super-resolution method using GAN","authors":"Ni Tang , Dongxiao Zhang , Juhao Gao , Yanyun Qu","doi":"10.1016/j.jvcir.2024.104164","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104164","url":null,"abstract":"<div><p>Single image super-resolution with diffusion probabilistic models (SRDiff) is a successful diffusion model for image super-resolution that produces high-quality images and is stable during training. However, due to the long sampling time, it is slower in the testing phase than other deep learning-based algorithms. Reducing the total number of diffusion steps can accelerate sampling, but it also causes the inverse diffusion process to deviate from the Gaussian distribution and exhibit a multimodal distribution, which violates the diffusion assumption and degrades the results. To overcome this limitation, we propose a fast SRDiff (FSRDiff) algorithm that integrates a generative adversarial network (GAN) with a diffusion model to speed up SRDiff. FSRDiff employs conditional GAN to approximate the multimodal distribution in the inverse diffusion process of the diffusion model, thus enhancing its sampling efficiency when reducing the total number of diffusion steps. The experimental results show that FSRDiff is nearly 20 times faster than SRDiff in reconstruction while maintaining comparable performance on the DIV2K test set.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140824328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive HEVC video steganograhpy based on PU partition modes","authors":"Shanshan Wang , Dawen Xu , Songhan He","doi":"10.1016/j.jvcir.2024.104176","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104176","url":null,"abstract":"<div><p>High Efficiency<!--> <!-->Video<!--> <!-->Coding (HEVC) −based steganography has gained attention as a prominent research focus. Especially, block structure based HEVC video steganography has received increasing attention due to commendable performance. However, current block structure- based steganography algorithms confront with challenges such as reduced coding efficiency and limited capacity. To avoid these problems, an adaptive video steganography algorithm based on Prediction Unit (PU) partition mode in I-frames is proposed. This is done through the analysis of the block division process and the visual distortion resulting from the modification of the PU partition mode in HEVC. The PU block structure is utilized as steganographic covers, and the Rate Distortion Optimization (RDO) technique is introduced to establish an adaptive distortion function for Syndrome-trellis code (STC). Further comparison is performed between the proposed method and the state-of-the-art steganography algorithms, confirming its advantages in embedding capacity, compression efficiency, visual quality, and resistance to video steganalysis.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140906073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haosong Ran , Diansheng Chen , Qinshu Chen , Yifei Li , Yazhe Luo , Xiaoyu Zhang , Jiting Li , Xiaochuan Zhang
{"title":"6-DoF grasp estimation method that fuses RGB-D data based on external attention","authors":"Haosong Ran , Diansheng Chen , Qinshu Chen , Yifei Li , Yazhe Luo , Xiaoyu Zhang , Jiting Li , Xiaochuan Zhang","doi":"10.1016/j.jvcir.2024.104173","DOIUrl":"10.1016/j.jvcir.2024.104173","url":null,"abstract":"<div><p>6-DoF grasp estimation methods based on point clouds have long been a challenge in robotics due to the limitations of single data input, which hinder the robot’s perception of real-world scenarios, thus reducing its robustness. In this work, we propose a 6-DoF grasp pose estimation method based on RGB-D data, which leverages ResNet to extract color image features, utilizes the PointNet++ network to extract geometric information features, and employs an external attention mechanism to fuse both features. Our method is an end-to-end design, and we validate its performance through benchmark tests on a large-scale dataset and evaluations in a simulated robot environment. Our method outperforms previous state-of-the-art methods on public datasets, achieving 47.75mAP and 40.08mAP for seen and unseen objects, respectively. We also test our grasp pose estimation method on multiple objects in a simulated robot environment, demonstrating that our approach exhibits higher grasp accuracy and robustness than previous methods.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141042765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}