既然能和机器人说话，为什么还要和人说话?野外远场说话人识别

2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN) Pub Date : 2021-08-08 DOI:10.1109/RO-MAN50785.2021.9515482

Galadrielle Humblot-Renaux, Chen Li, D. Chrysostomou

{"title":"既然能和机器人说话，为什么还要和人说话?野外远场说话人识别","authors":"Galadrielle Humblot-Renaux, Chen Li, D. Chrysostomou","doi":"10.1109/RO-MAN50785.2021.9515482","DOIUrl":null,"url":null,"abstract":"Equipping robots with the ability to identify who is talking to them is an important step towards natural and effective verbal interaction. However, speaker identification for voice control remains largely unexplored compared to recent progress in natural language instruction and speech recognition. This motivates us to tackle text-independent speaker identification for human-robot interaction applications in industrial environments. By representing audio segments as time-frequency spectrograms, this can be formulated as an image classification task, allowing us to apply state-of-the-art convolutional neural network (CNN) architectures. To achieve robust prediction in unconstrained, challenging acoustic conditions, we take a data-driven approach and collect a custom dataset with a far-field microphone array, featuring over 3 hours of \"in the wild\" audio recordings from six speakers, which are then encoded into spectral images for CNN-based classification. We propose a shallow 3-layer CNN, which we compare with the widely used ResNet-18 architecture: in addition to benchmarking these models in terms of accuracy, we visualize the features used by these two models to discriminate between classes, and investigate their reliability in unseen acoustic scenes. Although ResNet-18 reaches the highest raw accuracy, we are able to achieve remarkable online speaker recognition performance with a much more lightweight model which learns lower-level vocal features and produces more reliable confidence scores. The proposed method is successfully integrated into a robotic dialogue system and showcased in a mock user localization and authentication scenario in a realistic industrial environment: https://youtu.be/IVtZ8LKJZ7A.","PeriodicalId":6854,"journal":{"name":"2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN)","volume":"134 1","pages":"272-278"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Why talk to people when you can talk to robots? Far-field speaker identification in the wild\",\"authors\":\"Galadrielle Humblot-Renaux, Chen Li, D. Chrysostomou\",\"doi\":\"10.1109/RO-MAN50785.2021.9515482\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Equipping robots with the ability to identify who is talking to them is an important step towards natural and effective verbal interaction. However, speaker identification for voice control remains largely unexplored compared to recent progress in natural language instruction and speech recognition. This motivates us to tackle text-independent speaker identification for human-robot interaction applications in industrial environments. By representing audio segments as time-frequency spectrograms, this can be formulated as an image classification task, allowing us to apply state-of-the-art convolutional neural network (CNN) architectures. To achieve robust prediction in unconstrained, challenging acoustic conditions, we take a data-driven approach and collect a custom dataset with a far-field microphone array, featuring over 3 hours of \\\"in the wild\\\" audio recordings from six speakers, which are then encoded into spectral images for CNN-based classification. We propose a shallow 3-layer CNN, which we compare with the widely used ResNet-18 architecture: in addition to benchmarking these models in terms of accuracy, we visualize the features used by these two models to discriminate between classes, and investigate their reliability in unseen acoustic scenes. Although ResNet-18 reaches the highest raw accuracy, we are able to achieve remarkable online speaker recognition performance with a much more lightweight model which learns lower-level vocal features and produces more reliable confidence scores. The proposed method is successfully integrated into a robotic dialogue system and showcased in a mock user localization and authentication scenario in a realistic industrial environment: https://youtu.be/IVtZ8LKJZ7A.\",\"PeriodicalId\":6854,\"journal\":{\"name\":\"2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN)\",\"volume\":\"134 1\",\"pages\":\"272-278\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/RO-MAN50785.2021.9515482\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RO-MAN50785.2021.9515482","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

为机器人配备识别与之交谈的人的能力，是朝着自然有效的语言互动迈出的重要一步。然而，与自然语言教学和语音识别的最新进展相比，语音控制的说话人识别在很大程度上仍未被探索。这促使我们解决工业环境中人机交互应用的文本无关说话人识别问题。通过将音频片段表示为时频谱图，可以将其表述为图像分类任务，从而允许我们应用最先进的卷积神经网络(CNN)架构。为了在无约束、具有挑战性的声学条件下实现鲁棒预测，我们采用数据驱动的方法，使用远场麦克风阵列收集自定义数据集，其中包括来自六个扬声器的超过3小时的“野外”录音，然后将其编码为光谱图像，用于基于cnn的分类。我们提出了一个浅3层的CNN，并将其与广泛使用的ResNet-18架构进行比较:除了对这些模型的准确性进行基准测试外，我们还可视化了这两个模型用于区分类别的特征，并研究了它们在看不见的声学场景中的可靠性。虽然ResNet-18达到了最高的原始精度，但我们能够通过更轻量级的模型实现卓越的在线说话人识别性能，该模型可以学习较低水平的声音特征，并产生更可靠的置信度分数。所提出的方法已成功集成到机器人对话系统中，并在现实工业环境中的模拟用户定位和认证场景中进行了展示:https://youtu.be/IVtZ8LKJZ7A。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Why talk to people when you can talk to robots? Far-field speaker identification in the wild

Equipping robots with the ability to identify who is talking to them is an important step towards natural and effective verbal interaction. However, speaker identification for voice control remains largely unexplored compared to recent progress in natural language instruction and speech recognition. This motivates us to tackle text-independent speaker identification for human-robot interaction applications in industrial environments. By representing audio segments as time-frequency spectrograms, this can be formulated as an image classification task, allowing us to apply state-of-the-art convolutional neural network (CNN) architectures. To achieve robust prediction in unconstrained, challenging acoustic conditions, we take a data-driven approach and collect a custom dataset with a far-field microphone array, featuring over 3 hours of "in the wild" audio recordings from six speakers, which are then encoded into spectral images for CNN-based classification. We propose a shallow 3-layer CNN, which we compare with the widely used ResNet-18 architecture: in addition to benchmarking these models in terms of accuracy, we visualize the features used by these two models to discriminate between classes, and investigate their reliability in unseen acoustic scenes. Although ResNet-18 reaches the highest raw accuracy, we are able to achieve remarkable online speaker recognition performance with a much more lightweight model which learns lower-level vocal features and produces more reliable confidence scores. The proposed method is successfully integrated into a robotic dialogue system and showcased in a mock user localization and authentication scenario in a realistic industrial environment: https://youtu.be/IVtZ8LKJZ7A.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN)

自引率

0.00%

发文量