Athanasios Katsamanis, Vassilis Pitsikalis, Stavros Theodorakis, P. Maragos
{"title":"Multimodal gesture recognition","authors":"Athanasios Katsamanis, Vassilis Pitsikalis, Stavros Theodorakis, P. Maragos","doi":"10.1145/3015783.3015796","DOIUrl":null,"url":null,"abstract":"Starting from the famous \"Put That There!\" demonstration prototype, developed by the Architecture Machine Group at MIT in the late 1970s, the growing potential of multimodal gesture interfaces in natural human-machine communication setups has stimulated people's imagination and motivated significant research efforts in the fields of computer vision, speech recognition, multimodal sensing, fusion, and human-computer interaction (HCI). In the words of Bolt [1980, p. 1]: \"Because voice can be augmented with simultaneous pointing, the free usage of pronouns becomes possible, with a corresponding gain in naturalness and economy of expression. Conversely, gesture aided by voice gains precision in its power to reference.\" \n \nMultimodal gesture recognition lies at the heart of such interfaces. As also defined in the Glossary, the term refers to the complex computational task comprising three main modules: (a) tracking of human movements, primarily of the hands and arms, and recognition of characteristic such motion patterns; (b) detection of accompanying speech activity and recognition of what is spoken; and (c) combination of the available audio-visual information streams to identify the multimodally communicated message. \n \nTo successfully perform such tasks, the original \"Put That There!\" system of Bolt [1980] imposed certain limitations on the interaction. Specifically, it required that the user be tethered by wearing a position sensing device on the wrist to capture gesturing and a headset microphone to record speech, and it allowed multimodal manipulation via speech and gestures of a small only set of shapes on a rather large screen (see also Figure 11.1). Since then, however, research efforts in the field of multimodal gesture recognition have moved beyond such limited scenarios, capturing and processing the multimodal data streams by employing distant audio and visual sensors that are unobtrusive to humans. In particular, in recent years, the introduction of affordable and compact multimodal sensors like the Microsoft Kinect has enabled robust capturing of human activity. This is due to the wealth of raw and metadata streams provided by the device, in addition to the traditional planar RGB video, such as depth scene information, multiple audio channels, and human skeleton and facial tracking, among others [Kinect 2016]. Such advancements have led to intensified efforts to integrate multimodal gesture interfaces in real-life applications. \n \nIndeed, the field of multimodal gesture recognition has been attracting increasing interest, being driven by novel HCI paradigms on a continuously expanding range of devices equipped with multimodal sensors and ever-increasing computational power, for example smartphones and smart television sets. Nevertheless, the capabilities of modern multimodal gesture systems remain limited. In particular, the set of gestures accounted for in typical setups is mostly constrained to pointing gestures, a number of emblematic ones like an open palm, and gestures corresponding to some sort of interaction with a physical object, e.g., pinching for zooming. At the same time, fusion with speech remains in most cases just an experimental feature. When compared to the abundance and variety of gestures and their interaction with speech in natural human communication, it clearly seems that there is still a long way to go for the corresponding HCI research and development [Kopp 2013]. \n \nMultimodal gesture recognition constitutes a wide multi-disciplinary field. This chapter makes an effort to provide a comprehensive overview of it, both in theoretical and application terms. More specifically, basic concepts related to gesturing, the multifaceted interplay of gestures and speech, and the importance of gestures in HCI are discussed in Section 11.2. An overview of the current trends in the field of multimodal gesture recognition is provided in Section 11.3, separately focusing on gestures, speech, and multimodal fusion. Furthermore, a state-of-the-art recognition setup developed by the authors is described in detail in Section 11.4, in order to facilitate a better understanding of all practical considerations involved in such a system. In closing, the future of multimodal gesture recognition and related challenges are discussed in Section 11.5. Finally, a set of Focus Questions to aid comprehension of the material is also provided.","PeriodicalId":222911,"journal":{"name":"The Handbook of Multimodal-Multisensor Interfaces, Volume 1","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Handbook of Multimodal-Multisensor Interfaces, Volume 1","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3015783.3015796","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
Starting from the famous "Put That There!" demonstration prototype, developed by the Architecture Machine Group at MIT in the late 1970s, the growing potential of multimodal gesture interfaces in natural human-machine communication setups has stimulated people's imagination and motivated significant research efforts in the fields of computer vision, speech recognition, multimodal sensing, fusion, and human-computer interaction (HCI). In the words of Bolt [1980, p. 1]: "Because voice can be augmented with simultaneous pointing, the free usage of pronouns becomes possible, with a corresponding gain in naturalness and economy of expression. Conversely, gesture aided by voice gains precision in its power to reference."
Multimodal gesture recognition lies at the heart of such interfaces. As also defined in the Glossary, the term refers to the complex computational task comprising three main modules: (a) tracking of human movements, primarily of the hands and arms, and recognition of characteristic such motion patterns; (b) detection of accompanying speech activity and recognition of what is spoken; and (c) combination of the available audio-visual information streams to identify the multimodally communicated message.
To successfully perform such tasks, the original "Put That There!" system of Bolt [1980] imposed certain limitations on the interaction. Specifically, it required that the user be tethered by wearing a position sensing device on the wrist to capture gesturing and a headset microphone to record speech, and it allowed multimodal manipulation via speech and gestures of a small only set of shapes on a rather large screen (see also Figure 11.1). Since then, however, research efforts in the field of multimodal gesture recognition have moved beyond such limited scenarios, capturing and processing the multimodal data streams by employing distant audio and visual sensors that are unobtrusive to humans. In particular, in recent years, the introduction of affordable and compact multimodal sensors like the Microsoft Kinect has enabled robust capturing of human activity. This is due to the wealth of raw and metadata streams provided by the device, in addition to the traditional planar RGB video, such as depth scene information, multiple audio channels, and human skeleton and facial tracking, among others [Kinect 2016]. Such advancements have led to intensified efforts to integrate multimodal gesture interfaces in real-life applications.
Indeed, the field of multimodal gesture recognition has been attracting increasing interest, being driven by novel HCI paradigms on a continuously expanding range of devices equipped with multimodal sensors and ever-increasing computational power, for example smartphones and smart television sets. Nevertheless, the capabilities of modern multimodal gesture systems remain limited. In particular, the set of gestures accounted for in typical setups is mostly constrained to pointing gestures, a number of emblematic ones like an open palm, and gestures corresponding to some sort of interaction with a physical object, e.g., pinching for zooming. At the same time, fusion with speech remains in most cases just an experimental feature. When compared to the abundance and variety of gestures and their interaction with speech in natural human communication, it clearly seems that there is still a long way to go for the corresponding HCI research and development [Kopp 2013].
Multimodal gesture recognition constitutes a wide multi-disciplinary field. This chapter makes an effort to provide a comprehensive overview of it, both in theoretical and application terms. More specifically, basic concepts related to gesturing, the multifaceted interplay of gestures and speech, and the importance of gestures in HCI are discussed in Section 11.2. An overview of the current trends in the field of multimodal gesture recognition is provided in Section 11.3, separately focusing on gestures, speech, and multimodal fusion. Furthermore, a state-of-the-art recognition setup developed by the authors is described in detail in Section 11.4, in order to facilitate a better understanding of all practical considerations involved in such a system. In closing, the future of multimodal gesture recognition and related challenges are discussed in Section 11.5. Finally, a set of Focus Questions to aid comprehension of the material is also provided.
从著名的“Put That There!”演示原型开始,由麻省理工学院的建筑机器小组在20世纪70年代末开发,多模态手势界面在自然人机通信设置中的不断增长的潜力激发了人们的想象力,并推动了计算机视觉、语音识别、多模态传感、融合和人机交互(HCI)领域的重大研究工作。用Bolt [1980, p. 1]的话来说:“因为语音可以通过同时指向来增强,所以代词的自由使用成为可能,相应地在表达的自然性和经济性方面也得到了提高。相反,手势在声音的帮助下获得了精确的参考能力。”多模态手势识别是这些界面的核心。正如术语表中所定义的那样,该术语指的是由三个主要模块组成的复杂计算任务:(a)跟踪人类运动,主要是手和手臂的运动,并识别这些运动模式的特征;(b)检测伴随的言语活动和对所说内容的识别;以及(c)可用视听信息流的组合,以识别多模态通信的消息。为了成功地执行这些任务,Bolt[1980]最初的“Put That There!”系统对交互施加了一定的限制。具体来说,它要求用户通过在手腕上佩戴位置传感设备来捕捉手势,并通过耳机麦克风来记录语音,并且它允许在相当大的屏幕上通过语音和手势对一小组形状进行多模态操作(参见图11.1)。然而,从那时起,多模态手势识别领域的研究工作已经超越了这种有限的场景,通过使用对人类不引人注目的远程音频和视觉传感器来捕获和处理多模态数据流。特别是,近年来,像微软Kinect这样价格实惠、紧凑的多模态传感器的引入,使人类活动的捕捉成为可能。这是因为除了传统的平面RGB视频之外,设备还提供了丰富的原始和元数据流,例如深度场景信息、多个音频通道、人体骨骼和面部跟踪等[Kinect 2016]。这些进步导致了在实际应用中集成多模态手势界面的努力。事实上,多模态手势识别领域已经吸引了越来越多的兴趣,在新的HCI范例的推动下,越来越多的设备配备了多模态传感器和不断增加的计算能力,例如智能手机和智能电视机。然而,现代多模态手势系统的能力仍然有限。特别是,在典型的设置中,手势的设置主要局限于指向手势,一些象征性的手势,如张开的手掌,以及与物理对象的某种交互对应的手势,例如,挤压缩放。与此同时,语音融合在大多数情况下仍然只是一个实验性的特征。与人类自然交流中丰富多样的手势及其与语音的相互作用相比,显然,相应的HCI研究和开发还有很长的路要走[Kopp 2013]。多模态手势识别是一个广泛的多学科领域。本章力图从理论和应用两方面对其进行全面的概述。更具体地说,与手势有关的基本概念,手势和语言的多方面相互作用,以及手势在人机交互中的重要性将在11.2节中讨论。第11.3节概述了当前多模态手势识别领域的发展趋势,分别关注手势、语音和多模态融合。此外,第11.4节详细描述了作者开发的最先进的识别设置,以便更好地理解此类系统中涉及的所有实际考虑因素。最后,多模态手势识别的未来和相关挑战将在11.5节中讨论。最后,还提供了一套重点问题,以帮助理解材料。