Audition for multimedia computing

G. Friedland, P. Smaragdis, Josh H. McDermott, B. Raj
{"title":"Audition for multimedia computing","authors":"G. Friedland, P. Smaragdis, Josh H. McDermott, B. Raj","doi":"10.1145/3122865.3122868","DOIUrl":null,"url":null,"abstract":"What do the fields of robotics, human-computer interaction, AI, video retrieval, privacy, cybersecurity, Internet of Things, and big data all have in common? They all work with various sources of data: visual, textual, time stamps, links, records. But there is one source of data that has been almost completely ignored by the academic community---sound. \n \nOur comprehension of the world relies critically on audition---the ability to perceive and interpret the sounds we hear. Sound is ubiquitous, and is a unique source of information about our environment and the events occurring in it. Just by listening, we can determine whether our child's laughter originated inside or outside our house, how far away they were when they laughed, and whether the window through which the sound passed was open or shut. The ability to derive information about the world from sound is a core aspect of perceptual intelligence. \n \nAuditory inferences are often complex and sophisticated despite their routine occurrence. The number of possible inferences is typically not enumerable, and the final interpretation is not merely one of selection from a fixed set. And yet humans perform such inferences effortlessly, based only on sounds captured using two sensors, our ears. \n \nElectronic devices can also \"perceive\" sound. Every phone and tablet has at least one microphone, as do most cameras. Any device or space can be equipped with microphones at minimal expense. Indeed, machines can not only \"listen\"; they have potential advantages over humans as listening devices, in that they can communicate and coordinate their experiences in ways that biological systems simply cannot. Collections of devices that can sense sound and communicate with each other could instantiate a single electronic entity that far surpasses humans in its ability to record and process information from sound. \n \nAnd yet machines at present cannot truly hear. Apart from well-developed efforts to recover structure in speech and music, the state of the art in machine hearing is limited to relatively impoverished descriptions of recorded sounds: detecting occurrences of a limited pre-specified set of sound types, and their locations. Although researchers typically envision artificially intelligent agents such as robots to have human-like hearing abilities, at present the rich descriptions and inferences humans can make about sound are entirely beyond the capability of machine systems. \n \nIn this chapter, we suggest establishing the field of Computer Audition to develop the theory behind artificial systems that extract information from sound. Our objective is to enable computer systems to replicate and exceed human abilities. This chapter describes the challenges of this field.","PeriodicalId":408764,"journal":{"name":"Frontiers of Multimedia Research","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers of Multimedia Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3122865.3122868","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

What do the fields of robotics, human-computer interaction, AI, video retrieval, privacy, cybersecurity, Internet of Things, and big data all have in common? They all work with various sources of data: visual, textual, time stamps, links, records. But there is one source of data that has been almost completely ignored by the academic community---sound. Our comprehension of the world relies critically on audition---the ability to perceive and interpret the sounds we hear. Sound is ubiquitous, and is a unique source of information about our environment and the events occurring in it. Just by listening, we can determine whether our child's laughter originated inside or outside our house, how far away they were when they laughed, and whether the window through which the sound passed was open or shut. The ability to derive information about the world from sound is a core aspect of perceptual intelligence. Auditory inferences are often complex and sophisticated despite their routine occurrence. The number of possible inferences is typically not enumerable, and the final interpretation is not merely one of selection from a fixed set. And yet humans perform such inferences effortlessly, based only on sounds captured using two sensors, our ears. Electronic devices can also "perceive" sound. Every phone and tablet has at least one microphone, as do most cameras. Any device or space can be equipped with microphones at minimal expense. Indeed, machines can not only "listen"; they have potential advantages over humans as listening devices, in that they can communicate and coordinate their experiences in ways that biological systems simply cannot. Collections of devices that can sense sound and communicate with each other could instantiate a single electronic entity that far surpasses humans in its ability to record and process information from sound. And yet machines at present cannot truly hear. Apart from well-developed efforts to recover structure in speech and music, the state of the art in machine hearing is limited to relatively impoverished descriptions of recorded sounds: detecting occurrences of a limited pre-specified set of sound types, and their locations. Although researchers typically envision artificially intelligent agents such as robots to have human-like hearing abilities, at present the rich descriptions and inferences humans can make about sound are entirely beyond the capability of machine systems. In this chapter, we suggest establishing the field of Computer Audition to develop the theory behind artificial systems that extract information from sound. Our objective is to enable computer systems to replicate and exceed human abilities. This chapter describes the challenges of this field.
多媒体计算试听
机器人、人机交互、人工智能、视频检索、隐私、网络安全、物联网和大数据等领域有什么共同之处?它们都使用各种数据源:可视化的、文本的、时间戳的、链接的、记录的。但有一种数据来源几乎被学术界完全忽视了——声音。我们对世界的理解主要依赖于听觉——感知和解释我们听到的声音的能力。声音无处不在,是关于我们的环境和其中发生的事件的独特信息来源。只要听,我们就能确定孩子的笑声是来自家里还是外面,他们笑的时候有多远,声音穿过的窗户是开着的还是关着的。从声音中获取世界信息的能力是感知智能的一个核心方面。尽管听觉推理经常发生,但它们往往是复杂而复杂的。可能推论的数量通常是不可枚举的,而最终的解释也不仅仅是从一个固定集合中进行选择。然而,人类只根据耳朵这两个传感器捕捉到的声音,就能毫不费力地做出这样的推断。电子设备也能“感知”声音。每部手机和平板电脑都至少有一个麦克风,大多数相机也是如此。任何设备或空间都可以以最低的费用配备麦克风。的确,机器不仅能“倾听”;作为倾听设备,它们比人类有潜在的优势,因为它们可以以生物系统根本无法做到的方式交流和协调它们的经验。可以感知声音并相互交流的设备集合可以实例化一个电子实体,其记录和处理声音信息的能力远远超过人类。然而,目前的机器还不能真正听到声音。除了在恢复语音和音乐结构方面做出了充分的努力外,机器听力的最新技术还局限于对录制声音的相对贫乏的描述:检测有限的预先指定的一组声音类型的出现,以及它们的位置。虽然研究人员通常设想像机器人这样的人工智能代理具有类似人类的听觉能力,但目前人类对声音的丰富描述和推断完全超出了机器系统的能力。在本章中,我们建议建立计算机试听领域,以发展从声音中提取信息的人工系统背后的理论。我们的目标是使计算机系统能够复制并超越人类的能力。本章描述了该领域面临的挑战。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信