基于深度学习的扩展现实:让人类和机器说同样的视觉语言

Proceedings of the 1st Workshop on Interactive eXtended Reality Pub Date : 2022-10-10 DOI:10.1145/3552483.3555366

F. Pereira

{"title":"基于深度学习的扩展现实:让人类和机器说同样的视觉语言","authors":"F. Pereira","doi":"10.1145/3552483.3555366","DOIUrl":null,"url":null,"abstract":"The key goal of Extended Reality (XR) is to offer the human users immersive and interactive experiences, notably the sense of being in a virtual or augmented environment, interacting with virtual beings or objects. A fundamental element in this goal is the visual content, its realism, level of interactivity and immersion. The recent advances in visual data acquisition and consumption have led to the emergence of the so-called plenoptic visual models, where light fields and point clouds are playing an increasingly important role, offering 6DoF experiences in addition to the more common and limited 2D images and video-based experiences. This increased immersion is critical for emerging applications and services, notably virtual and augmented reality, personal communications and meetings, education and medical applications and virtual museum tours. To have effective remote experiences across the globe, it is critical that all types of visual information are efficiently compressed to be compatible with the bandwidth resources available. In this context, deep learning (DL)-based technologies came recently to play a central role, already overcoming the compression performances of the best previous, hand-made coding solutions. However, this breakthrough goes much beyond coding since DL-based tools are also nowadays the most effective for computer vision tasks such as classification, recognition, detection, and segmentation. This double win opens, for the first time, the door for a common visual representation language associated to the novel DL-based latents/coefficients which may simultaneously serve for human and machine consumption. While the humans will use the DL-based coded streams to decode immersive visual content, the machines will use the same precise streams for computer vision tasks, thus ?speaking' a common visual language. This is not possible with conventional visual representations, where the machine vision processors deal with decoded content, thus suffering from compression artifacts, and even at the cost of additional complexity. This visual representation approach will offer a more powerful and immersive augmented Extended Reality where humans and machines may more seamlessly participate at lower complexity. In this context, the main objective of this keynote talk is to discuss this DL-based dual-consumption paradigm, how it is being fulfilled and what are its impacts. Special attention will be dedicated to the ongoing standardization projects in this domain, notably in JPEG and MPEG.","PeriodicalId":140405,"journal":{"name":"Proceedings of the 1st Workshop on Interactive eXtended Reality","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep Learning-based Extended Reality: Making Humans and Machines Speak the Same Visual Language\",\"authors\":\"F. Pereira\",\"doi\":\"10.1145/3552483.3555366\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The key goal of Extended Reality (XR) is to offer the human users immersive and interactive experiences, notably the sense of being in a virtual or augmented environment, interacting with virtual beings or objects. A fundamental element in this goal is the visual content, its realism, level of interactivity and immersion. The recent advances in visual data acquisition and consumption have led to the emergence of the so-called plenoptic visual models, where light fields and point clouds are playing an increasingly important role, offering 6DoF experiences in addition to the more common and limited 2D images and video-based experiences. This increased immersion is critical for emerging applications and services, notably virtual and augmented reality, personal communications and meetings, education and medical applications and virtual museum tours. To have effective remote experiences across the globe, it is critical that all types of visual information are efficiently compressed to be compatible with the bandwidth resources available. In this context, deep learning (DL)-based technologies came recently to play a central role, already overcoming the compression performances of the best previous, hand-made coding solutions. However, this breakthrough goes much beyond coding since DL-based tools are also nowadays the most effective for computer vision tasks such as classification, recognition, detection, and segmentation. This double win opens, for the first time, the door for a common visual representation language associated to the novel DL-based latents/coefficients which may simultaneously serve for human and machine consumption. While the humans will use the DL-based coded streams to decode immersive visual content, the machines will use the same precise streams for computer vision tasks, thus ?speaking' a common visual language. This is not possible with conventional visual representations, where the machine vision processors deal with decoded content, thus suffering from compression artifacts, and even at the cost of additional complexity. This visual representation approach will offer a more powerful and immersive augmented Extended Reality where humans and machines may more seamlessly participate at lower complexity. In this context, the main objective of this keynote talk is to discuss this DL-based dual-consumption paradigm, how it is being fulfilled and what are its impacts. Special attention will be dedicated to the ongoing standardization projects in this domain, notably in JPEG and MPEG.\",\"PeriodicalId\":140405,\"journal\":{\"name\":\"Proceedings of the 1st Workshop on Interactive eXtended Reality\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 1st Workshop on Interactive eXtended Reality\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3552483.3555366\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st Workshop on Interactive eXtended Reality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3552483.3555366","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

扩展现实(XR)的主要目标是为人类用户提供身临其境的交互式体验，特别是在虚拟或增强环境中与虚拟存在或对象交互的感觉。这个目标的基本要素是视觉内容、现实主义、互动性和沉浸感。最近在视觉数据采集和消费方面的进步导致了所谓的全光学视觉模型的出现，其中光场和点云发挥着越来越重要的作用，除了更常见和有限的2D图像和基于视频的体验外，还提供了6DoF体验。这种增强的沉浸感对于新兴应用和服务至关重要，特别是虚拟和增强现实、个人通信和会议、教育和医疗应用以及虚拟博物馆之旅。为了在全球范围内获得有效的远程体验，至关重要的是，所有类型的视觉信息都必须被有效地压缩，以与可用的带宽资源兼容。在这种情况下，基于深度学习(DL)的技术最近开始发挥核心作用，已经克服了以前最好的手工编码解决方案的压缩性能。然而，这一突破远远超出了编码，因为基于dl的工具也是当今计算机视觉任务(如分类、识别、检测和分割)最有效的工具。这种双赢首次打开了一扇门，为一种与新颖的基于dl的潜在/系数相关的通用视觉表示语言打开了大门，这种语言可以同时为人类和机器服务。人类将使用基于dl的编码流来解码沉浸式视觉内容，而机器将使用相同的精确流来完成计算机视觉任务，从而“说”一种通用的视觉语言。这在传统的视觉表示中是不可能的，在传统的视觉表示中，机器视觉处理器处理解码的内容，因此受到压缩工件的影响，甚至以额外的复杂性为代价。这种视觉表现方法将提供更强大和身临其境的增强扩展现实，人类和机器可以以更低的复杂性更无缝地参与其中。在这种情况下，这次主题演讲的主要目的是讨论这种基于dl的双消费范式，它是如何实现的，以及它的影响是什么。将特别关注该领域正在进行的标准化项目，特别是JPEG和MPEG。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Deep Learning-based Extended Reality: Making Humans and Machines Speak the Same Visual Language

The key goal of Extended Reality (XR) is to offer the human users immersive and interactive experiences, notably the sense of being in a virtual or augmented environment, interacting with virtual beings or objects. A fundamental element in this goal is the visual content, its realism, level of interactivity and immersion. The recent advances in visual data acquisition and consumption have led to the emergence of the so-called plenoptic visual models, where light fields and point clouds are playing an increasingly important role, offering 6DoF experiences in addition to the more common and limited 2D images and video-based experiences. This increased immersion is critical for emerging applications and services, notably virtual and augmented reality, personal communications and meetings, education and medical applications and virtual museum tours. To have effective remote experiences across the globe, it is critical that all types of visual information are efficiently compressed to be compatible with the bandwidth resources available. In this context, deep learning (DL)-based technologies came recently to play a central role, already overcoming the compression performances of the best previous, hand-made coding solutions. However, this breakthrough goes much beyond coding since DL-based tools are also nowadays the most effective for computer vision tasks such as classification, recognition, detection, and segmentation. This double win opens, for the first time, the door for a common visual representation language associated to the novel DL-based latents/coefficients which may simultaneously serve for human and machine consumption. While the humans will use the DL-based coded streams to decode immersive visual content, the machines will use the same precise streams for computer vision tasks, thus ?speaking' a common visual language. This is not possible with conventional visual representations, where the machine vision processors deal with decoded content, thus suffering from compression artifacts, and even at the cost of additional complexity. This visual representation approach will offer a more powerful and immersive augmented Extended Reality where humans and machines may more seamlessly participate at lower complexity. In this context, the main objective of this keynote talk is to discuss this DL-based dual-consumption paradigm, how it is being fulfilled and what are its impacts. Special attention will be dedicated to the ongoing standardization projects in this domain, notably in JPEG and MPEG.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 1st Workshop on Interactive eXtended Reality

自引率

0.00%

发文量