Jointly Optimizing Sensing Pipelines for Multimodal Mixed Reality Interaction

2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS) Pub Date : 2020-10-13 DOI:10.1109/MASS50613.2020.00046

Darshana Rathnayake, Ashen de Silva, Dasun Puwakdandawa, L. Meegahapola, Archan Misra, I. Perera

{"title":"Jointly Optimizing Sensing Pipelines for Multimodal Mixed Reality Interaction","authors":"Darshana Rathnayake, Ashen de Silva, Dasun Puwakdandawa, L. Meegahapola, Archan Misra, I. Perera","doi":"10.1109/MASS50613.2020.00046","DOIUrl":null,"url":null,"abstract":"Natural human interactions for Mixed Reality Applications are overwhelmingly multimodal: humans communicate intent and instructions via a combination of visual, aural and gestural cues. However, supporting low-latency and accurate comprehension of such multimodal instructions (MMI), on resource-constrained wearable devices, remains an open challenge, especially as the state-of-the-art comprehension techniques for each individual modality increasingly utilize complex Deep Neural Network models. We demonstrate the possibility of overcoming the core limitation of latency–vs.–accuracy tradeoff by exploiting cross-modal dependencies–i.e., by compensating for the inferior performance of one model with an increased accuracy of more complex model of a different modality. We present a sensor fusion architecture that performs MMI comprehension in a quasi-synchronous fashion, by fusing visual, speech and gestural input. The architecture is reconfigurable and supports dynamic modification of the complexity of the data processing pipeline for each individual modality in response to contextual changes. Using a representative “classroom” context and a set of four common interaction primitives, we then demonstrate how the choices between low and high complexity models for each individual modality are coupled. In particular, we show that (a) a judicious combination of low and high complexity models across modalities can offer a dramatic 3-fold decrease in comprehension latency together with an increase $\\sim$10-15% in accuracy, and (b) the right collective choice of models is context dependent, with the performance of some model combinations being significantly more sensitive to changes in scene context or choice of interaction.","PeriodicalId":105795,"journal":{"name":"2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASS50613.2020.00046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Natural human interactions for Mixed Reality Applications are overwhelmingly multimodal: humans communicate intent and instructions via a combination of visual, aural and gestural cues. However, supporting low-latency and accurate comprehension of such multimodal instructions (MMI), on resource-constrained wearable devices, remains an open challenge, especially as the state-of-the-art comprehension techniques for each individual modality increasingly utilize complex Deep Neural Network models. We demonstrate the possibility of overcoming the core limitation of latency–vs.–accuracy tradeoff by exploiting cross-modal dependencies–i.e., by compensating for the inferior performance of one model with an increased accuracy of more complex model of a different modality. We present a sensor fusion architecture that performs MMI comprehension in a quasi-synchronous fashion, by fusing visual, speech and gestural input. The architecture is reconfigurable and supports dynamic modification of the complexity of the data processing pipeline for each individual modality in response to contextual changes. Using a representative “classroom” context and a set of four common interaction primitives, we then demonstrate how the choices between low and high complexity models for each individual modality are coupled. In particular, we show that (a) a judicious combination of low and high complexity models across modalities can offer a dramatic 3-fold decrease in comprehension latency together with an increase $\sim$10-15% in accuracy, and (b) the right collective choice of models is context dependent, with the performance of some model combinations being significantly more sensitive to changes in scene context or choice of interaction.

查看原文本刊更多论文

面向多模态混合现实交互的传感管道联合优化

混合现实应用中的自然人类交互绝大多数是多模式的:人类通过视觉、听觉和手势线索的组合来传达意图和指令。然而，在资源有限的可穿戴设备上支持低延迟和准确理解这种多模态指令(MMI)仍然是一个开放的挑战，特别是在每个模态的最先进理解技术越来越多地使用复杂的深度神经网络模型的情况下。我们证明了克服延迟vs的核心限制的可能性。-利用跨模态依赖的精度权衡-即，通过用不同模态的更复杂模型的更高精度来补偿一个模型的较差性能。我们提出了一种传感器融合架构，通过融合视觉、语音和手势输入，以准同步的方式执行MMI理解。该体系结构是可重新配置的，并支持动态修改每个单独模态的数据处理管道的复杂性，以响应上下文变化。然后，我们使用一个代表性的“教室”上下文和一组四种常见的交互原语，演示如何在每种单独的模态的低复杂性模型和高复杂性模型之间进行选择。特别是，我们表明(a)跨模式的低复杂性和高复杂性模型的明智组合可以提供3倍的理解延迟减少，同时提高10-15%的准确性，并且(b)正确的模型集体选择依赖于上下文，一些模型组合的性能对场景上下文或交互选择的变化明显更敏感。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS)

自引率

0.00%

发文量