VR+HD: Video Semantic Reconstruction From Spatio-Temporal Scene Graphs

IF 8.7 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal of Selected Topics in Signal Processing Pub Date : 2023-09-01 DOI:10.1109/JSTSP.2023.3323654

Chenxing Li;Yiping Duan;Qiyuan Du;Shiqi Sun;Xin Deng;Xiaoming Tao

{"title":"VR+HD: Video Semantic Reconstruction From Spatio-Temporal Scene Graphs","authors":"Chenxing Li;Yiping Duan;Qiyuan Du;Shiqi Sun;Xin Deng;Xiaoming Tao","doi":"10.1109/JSTSP.2023.3323654","DOIUrl":null,"url":null,"abstract":"With the development of computer science and deep learning networks, AI generation technology is becoming increasingly mature. Video has become one of the most important information carriers in our daily life because of their large amount of data and information. However, because of their large amount of information and complex semantics, video generation models, especially High Definition (HD) video, have been a difficult problem in the field of deep learning. Video semantic representation and semantic reconstruction are difficult tasks. Because video content is changeable and information is highly correlated, we propose a HD video generation model from a spatio-temporal scene graph: the spatio-temporal scene graph to video (StSg2vid) model. First, we enter the spatio-temporal scene graph sequence as the semantic representation model of the information in each frame of the video. The scene graph used to describe the semantic information of each frame contains the motion progress of the object in the video at that moment, which is equivalent to a clock. A spatio-temporal scene graph transmits the relationship information between objects through the graph convolutional neural network and predicts the scene layout of the moment. Lastly, the image generation model predicts the frame image of the current moment. The frame at each moment depends on the scene layout at the current moment and the frame and scene layout at the previous moment. We introduced the flow net, wrapping prediction model and the spatially-adaptive normalization (SPADE) network to generate images of each frame forecast. We used the Action genome dataset. Compared with the current state-of-the-art algorithms, the videos generated by our model achieve better results in both quantitative indicators and user evaluations. In addition, we also generalized the StSg2vid model into virtual reality (VR) videos of indoor scenes, preliminarily explored the generation method of VR videos, and achieved good results.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"17 5","pages":"935-948"},"PeriodicalIF":8.7000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10278415/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

With the development of computer science and deep learning networks, AI generation technology is becoming increasingly mature. Video has become one of the most important information carriers in our daily life because of their large amount of data and information. However, because of their large amount of information and complex semantics, video generation models, especially High Definition (HD) video, have been a difficult problem in the field of deep learning. Video semantic representation and semantic reconstruction are difficult tasks. Because video content is changeable and information is highly correlated, we propose a HD video generation model from a spatio-temporal scene graph: the spatio-temporal scene graph to video (StSg2vid) model. First, we enter the spatio-temporal scene graph sequence as the semantic representation model of the information in each frame of the video. The scene graph used to describe the semantic information of each frame contains the motion progress of the object in the video at that moment, which is equivalent to a clock. A spatio-temporal scene graph transmits the relationship information between objects through the graph convolutional neural network and predicts the scene layout of the moment. Lastly, the image generation model predicts the frame image of the current moment. The frame at each moment depends on the scene layout at the current moment and the frame and scene layout at the previous moment. We introduced the flow net, wrapping prediction model and the spatially-adaptive normalization (SPADE) network to generate images of each frame forecast. We used the Action genome dataset. Compared with the current state-of-the-art algorithms, the videos generated by our model achieve better results in both quantitative indicators and user evaluations. In addition, we also generalized the StSg2vid model into virtual reality (VR) videos of indoor scenes, preliminarily explored the generation method of VR videos, and achieved good results.

查看原文本刊更多论文

VR+HD:基于时空场景图的视频语义重构

随着计算机科学和深度学习网络的发展，人工智能生成技术日趋成熟。视频以其庞大的数据量和信息量成为我们日常生活中最重要的信息载体之一。然而，由于视频的信息量大、语义复杂，视频生成模型，特别是高清视频的生成模型一直是深度学习领域的难点问题。视频语义表示和语义重构是一个难点问题。由于视频内容多变，信息高度相关，我们提出了一种从时空场景图生成高清视频的模型:时空场景图到视频(StSg2vid)模型。首先，我们输入时空场景图序列作为视频每帧信息的语义表示模型。用于描述每一帧语义信息的场景图包含了视频中物体在该时刻的运动进度，相当于一个时钟。时空场景图通过图卷积神经网络传递物体之间的关系信息，并预测时刻的场景布局。最后，图像生成模型预测当前时刻的帧图像。每个时刻的帧取决于当前时刻的场景布局和前一刻的帧和场景布局。我们引入流网络、包裹预测模型和空间自适应归一化(SPADE)网络来生成每帧预测的图像。我们使用了Action基因组数据集。与目前最先进的算法相比，我们的模型生成的视频在定量指标和用户评价方面都取得了更好的效果。此外，我们还将StSg2vid模型推广到室内场景的虚拟现实(VR)视频中，初步探索了VR视频的生成方法，取得了较好的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Journal of Selected Topics in Signal Processing 工程技术-工程：电子与电气

CiteScore

19.00

自引率

1.30%

发文量

135

审稿时长

3 months

期刊介绍： The IEEE Journal of Selected Topics in Signal Processing (JSTSP) focuses on the Field of Interest of the IEEE Signal Processing Society, which encompasses the theory and application of various signal processing techniques. These techniques include filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals using digital or analog devices. The term "signal" covers a wide range of data types, including audio, video, speech, image, communication, geophysical, sonar, radar, medical, musical, and others. The journal format allows for in-depth exploration of signal processing topics, enabling the Society to cover both established and emerging areas. This includes interdisciplinary fields such as biomedical engineering and language processing, as well as areas not traditionally associated with engineering.