Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices.

IF 1.9 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2021-01-01 Epub Date: 2021-02-03 DOI:10.1186/s13636-020-00194-0

Rajat Hebbar, Pavlos Papadopoulos, Ramon Reyes, Alexander F Danvers, Angelina J Polsinelli, Suzanne A Moseley, David A Sbarra, Matthias R Mehl, Shrikanth Narayanan

{"title":"Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices.","authors":"Rajat Hebbar, Pavlos Papadopoulos, Ramon Reyes, Alexander F Danvers, Angelina J Polsinelli, Suzanne A Moseley, David A Sbarra, Matthias R Mehl, Shrikanth Narayanan","doi":"10.1186/s13636-020-00194-0","DOIUrl":null,"url":null,"abstract":"<p><p>Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several interfering factors such as dynamically varying environmental conditions, including background speakers, TV, or radio audio, render foreground speech detection to be a challenging task. Moreover, obtaining precise moment-to-moment annotations of audio streams for analysis and model training is also time-consuming and costly. In this work, we use multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled). We show how MIL can be applied to localize foreground speech in coarsely labeled audio and show both bag-level and instance-level results. We also study different pooling methods and how they can be adapted to densely distributed events as observed in our application. Finally, we show improvements using speech activity detection embeddings as features for foreground detection.</p>","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2021 1","pages":"7"},"PeriodicalIF":1.9000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13636-020-00194-0","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Audio Speech and Music Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13636-020-00194-0","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/2/3 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 9

Abstract

Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several interfering factors such as dynamically varying environmental conditions, including background speakers, TV, or radio audio, render foreground speech detection to be a challenging task. Moreover, obtaining precise moment-to-moment annotations of audio streams for analysis and model training is also time-consuming and costly. In this work, we use multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled). We show how MIL can be applied to localize foreground speech in coarsely labeled audio and show both bag-level and instance-level results. We also study different pooling methods and how they can be adapted to densely distributed events as observed in our application. Finally, we show improvements using speech activity detection embeddings as features for foreground detection.

Abstract Image

查看原文本刊更多论文

基于深度多实例学习的可穿戴设备环境音频前景语音定位。

近年来，机器学习技术已被用于在几个音频相关任务中产生最先进的结果。这些方法的成功很大程度上归功于对大量开源数据集的访问和计算资源的增强。然而，这些方法的一个缺点是，由于领域不匹配，它们往往不能很好地推广到现实生活场景中的任务。其中一项任务是来自可穿戴音频设备的前景语音检测。一些干扰因素，如动态变化的环境条件，包括背景扬声器，电视或广播音频，使前景语音检测成为一项具有挑战性的任务。此外，获取音频流的精确时刻注释用于分析和模型训练也非常耗时和昂贵。在这项工作中，我们使用多实例学习(MIL)来促进这种模型的开发，使用在较低时间分辨率(粗标记)下可用的注释。我们展示了如何应用MIL来定位粗标记音频中的前景语音，并显示了包级和实例级的结果。我们还研究了不同的池化方法，以及它们如何适应应用程序中观察到的密集分布事件。最后，我们展示了使用语音活动检测嵌入作为前景检测特征的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Eurasip Journal on Audio Speech and Music Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

4.10

自引率

4.20%

发文量

审稿时长

12 months

期刊介绍： The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.