{"title":"STSI: Efficiently Mine Spatio- Temporal Semantic Information between Different Multimodal for Video Captioning","authors":"Huiyu Xiong, Lanxiao Wang","doi":"10.1109/VCIP56404.2022.10008808","DOIUrl":"https://doi.org/10.1109/VCIP56404.2022.10008808","url":null,"abstract":"As one of the challenging tasks in computer vision, video captioning needs to use natural language to describe the content of video. Video contains complex information, such as semantic information, time information and so on. How to synthesize sentences effectively from rich and different kinds of information is very significant. The existing methods often cannot well integrate the multimodal feature to predict the association between different objects in video. In this paper, we improve the existing encoder-decoder structure and propose a network deeply mining the spatio-temporal correlation between multimodal features. Through the analysis of sentence components, we use spatio-temporal semantic information mining module to fuse the object, 2D and 3D features in both time and space. It is worth mentioning that the word output at the previous time is added as the prediction branch of auxiliary conjunctions. After that, a dynamic gumbel scorer is used to output caption sentences that are more consistent with the facts. The experimental results on two benchmark datasets show that our STSI is superior to the state-of-the-art methods while generating more reasonable and semantic-logical sentences.","PeriodicalId":269379,"journal":{"name":"2022 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123002008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"No-reference Stereoscopic Image Quality Assessment Based on Parallel Multi-scale Perception","authors":"Ziyi Zhang, Sumei Li","doi":"10.1109/VCIP56404.2022.10008875","DOIUrl":"https://doi.org/10.1109/VCIP56404.2022.10008875","url":null,"abstract":"With the rapid development of 3D technologies, effective no-reference stereoscopic image quality assessment (NR-SIQA) methods are in great demand. In this paper, we propose a parallel multi-scale feature extraction convolution neural network (CNN) model combined with novel binocular feature interaction consistent with human visual system (HVS). In order to simulate the characteristics of HVS sensing multi-scale information at the same time, parallel multi-scale feature extraction module (PMSFM) followed by compensation information is proposed. And modified convolutional block attention module (MCBAM) with less computational complexity is designed to generate visual attention maps for the multi-scale features extracted by the PMSFM. In addition, we employ cross-stacked strategy for multi-level binocular fusion maps and binocular disparity maps to simulate the hierarchical perception characteristics of HVS. Experimental results show that our method is superior to the state-of-the-art metrics and achieves an excellent performance.","PeriodicalId":269379,"journal":{"name":"2022 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131446027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qing Ding, Mai Xu, Shengxi Li, Xin Deng, Qiu Shen, Xin Zou
{"title":"A Learning-based Approach for Martian Image Compression","authors":"Qing Ding, Mai Xu, Shengxi Li, Xin Deng, Qiu Shen, Xin Zou","doi":"10.1109/VCIP56404.2022.10008891","DOIUrl":"https://doi.org/10.1109/VCIP56404.2022.10008891","url":null,"abstract":"For the scientific exploration and research on Mars, it is an indispensable step to transmit high-quality Martian images from distant Mars to Earth. Image compression is the key technique given the extremely limited Mars-Earth bandwidth. Recently, deep learning has demonstrated remarkable performance in natural image compression, which provides a possibility for efficient Martian image compression. However, deep learning usually requires large training data. In this paper, we establish the first large-scale high-resolution Martian image compression (MIC) dataset. Through analyzing this dataset, we observe an important non-local self-similarity prior for Marian images. Benefiting from this prior, we propose a deep Martian image compression network with the non-local block to explore both local and non-local dependencies among Martian image patches. Experimental results verify the effectiveness of the proposed network in Martian image compression, which outperforms both the deep learning based compression methods and HEVC codec.","PeriodicalId":269379,"journal":{"name":"2022 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122093323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Fast Motion Estimation Method With Hamming Distance for LiDAR Point Cloud Compression","authors":"Yuhao An, Yiting Shao, Ge Li, Wei Gao, Shan Liu","doi":"10.1109/VCIP56404.2022.10008842","DOIUrl":"https://doi.org/10.1109/VCIP56404.2022.10008842","url":null,"abstract":"With more three-dimensional space information, Light detection and ranging (LiDAR) point clouds, which are promising to play more roles in the future, have an urgent need to be efficiently compressed. There are lots of compression methods based on spatial correlations, whereas few studies consider exploiting temporal correlations. In this paper, we propose a different perspective for the motion estimation. In most previous works, geometric distance between matching points was used as the criterion, which has an expensive computational cost and is not accurate. We first propose the Hamming distance between the octree's nodes, instead of the geometric distance between per point which is a more direct criterion. We have implemented our method in the MPEG (Moving Picture Expert Group) Geometry-based PCC (Point Cloud Compression) inter-exploration (G-PCC Inter-EM). Experimental results show our method can provide the average 3.5 % bitrate savings and 92.5 % encoding speed increase in lossless geometric coding, compared to the G-PCC Inter-EM.","PeriodicalId":269379,"journal":{"name":"2022 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115316715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyomin Choi, Fabien Racapé, Shahab Hamidi-Rad, Mateen Ulhaq, Simon Feltman
{"title":"Frequency-aware Learned Image Compression for Quality Scalability","authors":"Hyomin Choi, Fabien Racapé, Shahab Hamidi-Rad, Mateen Ulhaq, Simon Feltman","doi":"10.1109/VCIP56404.2022.10008818","DOIUrl":"https://doi.org/10.1109/VCIP56404.2022.10008818","url":null,"abstract":"Spatial frequency analysis and transforms serve a central role in most engineered image and video lossy codecs, but are rarely employed in neural network (NN)-based approaches. We propose a novel NN-based image coding framework that utilizes forward wavelet transforms to decompose the input signal by spatial frequency. Our encoder generates separate bitstreams for each latent representation of low and high frequencies. This enables our decoder to selectively decode bitstreams in a quality-scalable manner. Hence, the decoder can produce an enhanced image by using an enhancement bitstream in addition to the base bitstream. Furthermore, our method is able to enhance only a specific region of interest (ROI) by using a corresponding part of the enhancement latent representation. Our experiments demonstrate that the proposed method shows competitive rate-distortion performance compared to several non-scalable image codecs. We also showcase the effectiveness of our two-level quality scalability, as well as its practicality in ROI quality enhancement.","PeriodicalId":269379,"journal":{"name":"2022 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124901426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CFNet: A Coarse-to-Fine Network for Few Shot Semantic Segmentation","authors":"Jiade Liu, Cheolkon Jung","doi":"10.1109/VCIP56404.2022.10008845","DOIUrl":"https://doi.org/10.1109/VCIP56404.2022.10008845","url":null,"abstract":"Since a huge amount of datasets is required for semantic segmentation, few shot semantic segmentation has attracted more and more attention of researchers. It aims to achieve semantic segmentation for unknown categories from only a small number of annotated training samples. Existing models for few shot semantic segmentation directly generate segmentation results and concentrate on learning the relationship between pixels, thus ignoring the spatial structure of features and decreasing the model learning ability. In this paper, we propose a coarse-to-fine network for few shot semantic segmentation, named CFNet. Firstly, we design a region selection module based on prototype learning to select the approximate region corresponding to the unknown category of the query image. Secondly, we elaborately combine the attention mechanism with the convolution module to learn the spatial structure of features and optimize the selected region. For the attention mechanism, we combine channel attention with self-attention to enhance the model ability of exploring the spatial structure of features and the pixel-wise relationship between support and query images. Experimental results show that CFNet achieves 65.2% and 70.1% in mean-IoU (mIoU) on PASCAL-5i for 1-shot and 5-shot settings, respectively, and outperforms state-of-the-art methods by 1.0%.","PeriodicalId":269379,"journal":{"name":"2022 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124444725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Blind Gaussian Deep Denoiser Network using Multi-Scale Pixel Attention","authors":"Ramesh Kumar Thakur, S. K. Maji","doi":"10.1109/VCIP56404.2022.10008856","DOIUrl":"https://doi.org/10.1109/VCIP56404.2022.10008856","url":null,"abstract":"Many deep learning networks focus on the task of Gaussian denoising by processing images on a fixed scale or multiple scales using convolution and deconvolution. In certain cases, excessive scaling applied in the network results in the loss of image details. Sometimes, the usage of deeper convolutional networks results in the loss of network gradient. In this paper, to overcome both the problems, we propose a multi-scale pixel attention-based blind Gaussian denoiser network that utilizes a combination of important features at five different scales. The proposed network performs blind Gaussian denoising in the sense that it does not need any prior information about noise. It comprises a central multi-scale pixel attention block together with dilated convolutional layers and skip connections that help in utilizing the full receptive field of the first convolutional layer to the last convolutional layer and is based on residual architecture for propagating high-level information easily in the network. We have provided the code of the proposed technique at https://github.com/RTSIR/MSPABDN.","PeriodicalId":269379,"journal":{"name":"2022 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131564397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"No Reference Stereoscopic Video Quality Assessment based on Human Vision System","authors":"Xiaofang Zhang, Sumei Li","doi":"10.1109/VCIP56404.2022.10008866","DOIUrl":"https://doi.org/10.1109/VCIP56404.2022.10008866","url":null,"abstract":"In this paper, we propose a no-reference stereoscopic video quality assessment (NR-SVQA) based on human vision system (HVS). Firstly, we build a frequency transform module (FTM), which maps spatial domain to frequency domain by cosine discrete transform (DCT), and selects important frequency components through channel attention mechanism. Secondly, we use dynamic convolution to regionally process the same input. Thirdly, we use convolutional long short term memory (Conv-LSTM) to extract spatio-temporal information rather than just temporal information. Finally, in order to better simulate the visual characteristics of human eyes, we build a optic chiasm module. The experiment results show that our method outperforms any other methods.","PeriodicalId":269379,"journal":{"name":"2022 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131028001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongkui Wang, Junhui Liang, Li Yu, Y. Gu, Haibing Yin
{"title":"Generalized Gaussian Distribution Based Distortion Model for the H.266/VVC Video Coder","authors":"Hongkui Wang, Junhui Liang, Li Yu, Y. Gu, Haibing Yin","doi":"10.1109/VCIP56404.2022.10008905","DOIUrl":"https://doi.org/10.1109/VCIP56404.2022.10008905","url":null,"abstract":"In versatile video coding (VVC), superior coding performance is achieved with incorporating many advanced coding tools. In this paper, a frame-level coding distortion model is proposed for VVC video coders for the first time. In comparison with the transform coefficient distribution (TCD) of High Effective Video Coding (HEVC), the TCD of VVC has a sharper peak. According to this observation, the TCDs of I, B and P frames are modeled by the probability density function (PDF) of generalized Gaussian distribution (GGD) with three fixed shape parameters. The GGD-based distortion model is then derived with a sliding window-based strategy, i.e., the frame-level coding distortion is formulated as the function of the distribution parameter of frame-level TCD and the quantization step. The experimental results show that the proposed model achieves accurate results of distortion estimation for VVC coders.","PeriodicalId":269379,"journal":{"name":"2022 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115410518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CdCLR: Clip-Driven Contrastive Learning for Skeleton-Based Action Recognition","authors":"Rong Gao, Xin Liu, Jingyu Yang, Huanjing Yue","doi":"10.1109/VCIP56404.2022.10008837","DOIUrl":"https://doi.org/10.1109/VCIP56404.2022.10008837","url":null,"abstract":"In this study, we propose a Clip-Driven Contrastive Learning for Skeleton-Based Action Recognition (CdCLR). In-stead of considering sequences as instances, CdCLR extracts clips from the sequences as new instances. Aim to implement inherent supervision-guided contrastive learning through joint optimal training of sequences discrimination, clips discrimination, and order verification. Mining abundant positive/negative pairs inside sequence while learning inter-and intra-sequence semantic repre-sentations. Extensive experiments on the NTU RGB+D 60, UCLA and iMiGUE datasets present that CdCLR exhibits superior performance under various evaluation protocols and reaches state-of-the-art. Our code is available at https://github.com/Erich-G/CdCLRI.","PeriodicalId":269379,"journal":{"name":"2022 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":"60 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114025618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}