基于最小训练数据的卷积神经网络语音识别智能家居解决方案

2018 11th International Conference on Human System Interaction (HSI) Pub Date : 2018-07-01 DOI:10.1109/HSI.2018.8431363

Mingshan Wang, Tejaswini Sirlapu, A. Kwaśniewska, Maciej Szankin, Marko Bartscherer, Rey Nicolas

{"title":"基于最小训练数据的卷积神经网络语音识别智能家居解决方案","authors":"Mingshan Wang, Tejaswini Sirlapu, A. Kwaśniewska, Maciej Szankin, Marko Bartscherer, Rey Nicolas","doi":"10.1109/HSI.2018.8431363","DOIUrl":null,"url":null,"abstract":"With the technology advancements in smart home sector, voice control and automation are key components that can make a real difference in people's lives. The voice recognition technology market continues to involve rapidly as almost all smart home devices are providing speaker recognition capability today. However, most of them provide cloud-based solutions or use very deep Neural Networks for speaker recognition task, which are not suitable models to run on smart home devices. In this paper, we compare relatively small Convolutional Neural Networks (CNN) and evaluate effectiveness of speaker recognition using these models on edge devices. In addition, we also apply transfer learning technique to deal with a problem of limited training data. By developing solution suitable for running inference locally on edge devices, we eliminate the well-known cloud computing issues, such as data privacy and network latency, etc. The preliminary results proved that the chosen model adapts the benefit of computer vision task by using CNN and spectrograms to perform speaker classification with precision and recall ~84 % in time less than 60 ms on mobile device with Atom Cherry Trail processor.","PeriodicalId":441117,"journal":{"name":"2018 11th International Conference on Human System Interaction (HSI)","volume":"26 1-2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Speaker Recognition Using Convolutional Neural Network with Minimal Training Data for Smart Home Solutions\",\"authors\":\"Mingshan Wang, Tejaswini Sirlapu, A. Kwaśniewska, Maciej Szankin, Marko Bartscherer, Rey Nicolas\",\"doi\":\"10.1109/HSI.2018.8431363\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the technology advancements in smart home sector, voice control and automation are key components that can make a real difference in people's lives. The voice recognition technology market continues to involve rapidly as almost all smart home devices are providing speaker recognition capability today. However, most of them provide cloud-based solutions or use very deep Neural Networks for speaker recognition task, which are not suitable models to run on smart home devices. In this paper, we compare relatively small Convolutional Neural Networks (CNN) and evaluate effectiveness of speaker recognition using these models on edge devices. In addition, we also apply transfer learning technique to deal with a problem of limited training data. By developing solution suitable for running inference locally on edge devices, we eliminate the well-known cloud computing issues, such as data privacy and network latency, etc. The preliminary results proved that the chosen model adapts the benefit of computer vision task by using CNN and spectrograms to perform speaker classification with precision and recall ~84 % in time less than 60 ms on mobile device with Atom Cherry Trail processor.\",\"PeriodicalId\":441117,\"journal\":{\"name\":\"2018 11th International Conference on Human System Interaction (HSI)\",\"volume\":\"26 1-2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 11th International Conference on Human System Interaction (HSI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HSI.2018.8431363\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 11th International Conference on Human System Interaction (HSI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HSI.2018.8431363","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

随着智能家居领域的技术进步，语音控制和自动化是可以真正改变人们生活的关键组成部分。语音识别技术市场继续快速发展，因为今天几乎所有的智能家居设备都提供扬声器识别功能。然而，它们大多提供基于云的解决方案或使用非常深度的神经网络来完成说话人识别任务，这些模型不适合在智能家居设备上运行。在本文中，我们比较了相对较小的卷积神经网络(CNN)，并评估了在边缘设备上使用这些模型识别说话人的有效性。此外，我们还应用迁移学习技术来解决训练数据有限的问题。通过开发适合在边缘设备上本地运行推理的解决方案，我们消除了众所周知的云计算问题，如数据隐私和网络延迟等。初步结果表明，所选择的模型充分利用了计算机视觉任务的优势，在Atom Cherry Trail处理器的移动设备上，利用CNN和频谱图对说话人进行分类，在小于60 ms的时间内，准确率和召回率达到84%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Speaker Recognition Using Convolutional Neural Network with Minimal Training Data for Smart Home Solutions

With the technology advancements in smart home sector, voice control and automation are key components that can make a real difference in people's lives. The voice recognition technology market continues to involve rapidly as almost all smart home devices are providing speaker recognition capability today. However, most of them provide cloud-based solutions or use very deep Neural Networks for speaker recognition task, which are not suitable models to run on smart home devices. In this paper, we compare relatively small Convolutional Neural Networks (CNN) and evaluate effectiveness of speaker recognition using these models on edge devices. In addition, we also apply transfer learning technique to deal with a problem of limited training data. By developing solution suitable for running inference locally on edge devices, we eliminate the well-known cloud computing issues, such as data privacy and network latency, etc. The preliminary results proved that the chosen model adapts the benefit of computer vision task by using CNN and spectrograms to perform speaker classification with precision and recall ~84 % in time less than 60 ms on mobile device with Atom Cherry Trail processor.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 11th International Conference on Human System Interaction (HSI)

自引率

0.00%

发文量