VR+HD: Video Semantic Reconstruction From Spatio-Temporal Scene Graphs

IF 8.7 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Chenxing Li;Yiping Duan;Qiyuan Du;Shiqi Sun;Xin Deng;Xiaoming Tao
{"title":"VR+HD: Video Semantic Reconstruction From Spatio-Temporal Scene Graphs","authors":"Chenxing Li;Yiping Duan;Qiyuan Du;Shiqi Sun;Xin Deng;Xiaoming Tao","doi":"10.1109/JSTSP.2023.3323654","DOIUrl":null,"url":null,"abstract":"With the development of computer science and deep learning networks, AI generation technology is becoming increasingly mature. Video has become one of the most important information carriers in our daily life because of their large amount of data and information. However, because of their large amount of information and complex semantics, video generation models, especially High Definition (HD) video, have been a difficult problem in the field of deep learning. Video semantic representation and semantic reconstruction are difficult tasks. Because video content is changeable and information is highly correlated, we propose a HD video generation model from a spatio-temporal scene graph: the spatio-temporal scene graph to video (StSg2vid) model. First, we enter the spatio-temporal scene graph sequence as the semantic representation model of the information in each frame of the video. The scene graph used to describe the semantic information of each frame contains the motion progress of the object in the video at that moment, which is equivalent to a clock. A spatio-temporal scene graph transmits the relationship information between objects through the graph convolutional neural network and predicts the scene layout of the moment. Lastly, the image generation model predicts the frame image of the current moment. The frame at each moment depends on the scene layout at the current moment and the frame and scene layout at the previous moment. We introduced the flow net, wrapping prediction model and the spatially-adaptive normalization (SPADE) network to generate images of each frame forecast. We used the Action genome dataset. Compared with the current state-of-the-art algorithms, the videos generated by our model achieve better results in both quantitative indicators and user evaluations. In addition, we also generalized the StSg2vid model into virtual reality (VR) videos of indoor scenes, preliminarily explored the generation method of VR videos, and achieved good results.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"17 5","pages":"935-948"},"PeriodicalIF":8.7000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10278415/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

With the development of computer science and deep learning networks, AI generation technology is becoming increasingly mature. Video has become one of the most important information carriers in our daily life because of their large amount of data and information. However, because of their large amount of information and complex semantics, video generation models, especially High Definition (HD) video, have been a difficult problem in the field of deep learning. Video semantic representation and semantic reconstruction are difficult tasks. Because video content is changeable and information is highly correlated, we propose a HD video generation model from a spatio-temporal scene graph: the spatio-temporal scene graph to video (StSg2vid) model. First, we enter the spatio-temporal scene graph sequence as the semantic representation model of the information in each frame of the video. The scene graph used to describe the semantic information of each frame contains the motion progress of the object in the video at that moment, which is equivalent to a clock. A spatio-temporal scene graph transmits the relationship information between objects through the graph convolutional neural network and predicts the scene layout of the moment. Lastly, the image generation model predicts the frame image of the current moment. The frame at each moment depends on the scene layout at the current moment and the frame and scene layout at the previous moment. We introduced the flow net, wrapping prediction model and the spatially-adaptive normalization (SPADE) network to generate images of each frame forecast. We used the Action genome dataset. Compared with the current state-of-the-art algorithms, the videos generated by our model achieve better results in both quantitative indicators and user evaluations. In addition, we also generalized the StSg2vid model into virtual reality (VR) videos of indoor scenes, preliminarily explored the generation method of VR videos, and achieved good results.
VR+HD:基于时空场景图的视频语义重构
随着计算机科学和深度学习网络的发展,人工智能生成技术日趋成熟。视频以其庞大的数据量和信息量成为我们日常生活中最重要的信息载体之一。然而,由于视频的信息量大、语义复杂,视频生成模型,特别是高清视频的生成模型一直是深度学习领域的难点问题。视频语义表示和语义重构是一个难点问题。由于视频内容多变,信息高度相关,我们提出了一种从时空场景图生成高清视频的模型:时空场景图到视频(StSg2vid)模型。首先,我们输入时空场景图序列作为视频每帧信息的语义表示模型。用于描述每一帧语义信息的场景图包含了视频中物体在该时刻的运动进度,相当于一个时钟。时空场景图通过图卷积神经网络传递物体之间的关系信息,并预测时刻的场景布局。最后,图像生成模型预测当前时刻的帧图像。每个时刻的帧取决于当前时刻的场景布局和前一刻的帧和场景布局。我们引入流网络、包裹预测模型和空间自适应归一化(SPADE)网络来生成每帧预测的图像。我们使用了Action基因组数据集。与目前最先进的算法相比,我们的模型生成的视频在定量指标和用户评价方面都取得了更好的效果。此外,我们还将StSg2vid模型推广到室内场景的虚拟现实(VR)视频中,初步探索了VR视频的生成方法,取得了较好的效果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Journal of Selected Topics in Signal Processing
IEEE Journal of Selected Topics in Signal Processing 工程技术-工程:电子与电气
CiteScore
19.00
自引率
1.30%
发文量
135
审稿时长
3 months
期刊介绍: The IEEE Journal of Selected Topics in Signal Processing (JSTSP) focuses on the Field of Interest of the IEEE Signal Processing Society, which encompasses the theory and application of various signal processing techniques. These techniques include filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals using digital or analog devices. The term "signal" covers a wide range of data types, including audio, video, speech, image, communication, geophysical, sonar, radar, medical, musical, and others. The journal format allows for in-depth exploration of signal processing topics, enabling the Society to cover both established and emerging areas. This includes interdisciplinary fields such as biomedical engineering and language processing, as well as areas not traditionally associated with engineering.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信