在具有挑战性的工业环境中进行可靠的语音命令识别

IF 4.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Computer Communications Pub Date : 2024-09-02 DOI:10.1016/j.comcom.2024.107938

Stefano Bini, Vincenzo Carletti, Alessia Saggese, Mario Vento

{"title":"在具有挑战性的工业环境中进行可靠的语音命令识别","authors":"Stefano Bini, Vincenzo Carletti, Alessia Saggese, Mario Vento","doi":"10.1016/j.comcom.2024.107938","DOIUrl":null,"url":null,"abstract":"<div><p>Speech is among the main forms of communication between humans and robots in industrial settings, being the most natural way for a human worker to issue commands. However, the presence of pervasive and loud environmental noise poses significant challenges to the adoption of Speech-Command Recognition systems onboard manufacturing robots; indeed, they are expected to perform in real time on hardware with limited computational capabilities and also to be robust and accurate in such complex environments. In this paper, we propose an innovative system based on an End-to-End architecture with a Conformer backbone. Our system is specifically designed to achieve high accuracy in noisy industrial environments and to guarantee a minimal computational burden to meet stringent real-time requirements while running on computing devices that are embedded in robots. In order to increase the generalization capability of the system, the training procedure is driven by a Curriculum Learning strategy combined with dynamic data augmentation techniques, that progressively increase the complexity of input samples by increasing the noise during the training phase. We have conducted extensive experimentation to assess the effectiveness of our system, using a dataset composed of more than 50,000 samples, of which about 2,000 have been acquired during the daily operations of a Stellantis Italian factory. The results confirm the suitability of the proposed approach to be adopted in a real industrial environment; indeed, it is able to achieve, on both English and Italian commands, an accuracy higher than 90%, maintaining a compact model size (the network is 1.81 <span><math><mrow><mi>M</mi><mi>B</mi></mrow></math></span>) and running in real-time on an industrial embedded device (namely <span><math><mrow><mn>41</mn><mspace></mspace><mi>ms</mi></mrow></math></span> over an NVIDIA Xavier NX).</p></div>","PeriodicalId":55224,"journal":{"name":"Computer Communications","volume":"228 ","pages":"Article 107938"},"PeriodicalIF":4.5000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Robust speech command recognition in challenging industrial environments\",\"authors\":\"Stefano Bini, Vincenzo Carletti, Alessia Saggese, Mario Vento\",\"doi\":\"10.1016/j.comcom.2024.107938\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Speech is among the main forms of communication between humans and robots in industrial settings, being the most natural way for a human worker to issue commands. However, the presence of pervasive and loud environmental noise poses significant challenges to the adoption of Speech-Command Recognition systems onboard manufacturing robots; indeed, they are expected to perform in real time on hardware with limited computational capabilities and also to be robust and accurate in such complex environments. In this paper, we propose an innovative system based on an End-to-End architecture with a Conformer backbone. Our system is specifically designed to achieve high accuracy in noisy industrial environments and to guarantee a minimal computational burden to meet stringent real-time requirements while running on computing devices that are embedded in robots. In order to increase the generalization capability of the system, the training procedure is driven by a Curriculum Learning strategy combined with dynamic data augmentation techniques, that progressively increase the complexity of input samples by increasing the noise during the training phase. We have conducted extensive experimentation to assess the effectiveness of our system, using a dataset composed of more than 50,000 samples, of which about 2,000 have been acquired during the daily operations of a Stellantis Italian factory. The results confirm the suitability of the proposed approach to be adopted in a real industrial environment; indeed, it is able to achieve, on both English and Italian commands, an accuracy higher than 90%, maintaining a compact model size (the network is 1.81 <span><math><mrow><mi>M</mi><mi>B</mi></mrow></math></span>) and running in real-time on an industrial embedded device (namely <span><math><mrow><mn>41</mn><mspace></mspace><mi>ms</mi></mrow></math></span> over an NVIDIA Xavier NX).</p></div>\",\"PeriodicalId\":55224,\"journal\":{\"name\":\"Computer Communications\",\"volume\":\"228 \",\"pages\":\"Article 107938\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2024-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Communications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0140366424002858\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Communications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0140366424002858","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

在工业环境中，语音是人类与机器人交流的主要形式之一，也是人类工人发出指令的最自然方式。然而，无处不在的嘈杂环境噪声给制造机器人上的语音命令识别系统的应用带来了巨大挑战；事实上，人们期望这些系统能在计算能力有限的硬件上实时运行，并在如此复杂的环境中保持稳定和准确。在本文中，我们提出了一种基于端到端架构和 Conformer 骨干的创新系统。我们的系统专门设计用于在嘈杂的工业环境中实现高精度，并保证在机器人嵌入式计算设备上运行时，计算负担最小，以满足严格的实时性要求。为了提高系统的泛化能力，训练过程由课程学习策略与动态数据增强技术相结合，通过在训练阶段增加噪声来逐步提高输入样本的复杂性。我们使用由 50,000 多个样本组成的数据集进行了广泛的实验，以评估系统的有效性，其中约 2,000 个样本是在意大利 Stellantis 工厂的日常运营中获取的。实验结果证实，所提出的方法适合在实际工业环境中使用；事实上，该系统对英语和意大利语命令的准确率均高于 90%，模型大小小巧（网络大小为 1.81 MB），可在工业嵌入式设备上实时运行（在英伟达 Xavier NX 上的运行时间为 41 毫秒）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Robust speech command recognition in challenging industrial environments

Speech is among the main forms of communication between humans and robots in industrial settings, being the most natural way for a human worker to issue commands. However, the presence of pervasive and loud environmental noise poses significant challenges to the adoption of Speech-Command Recognition systems onboard manufacturing robots; indeed, they are expected to perform in real time on hardware with limited computational capabilities and also to be robust and accurate in such complex environments. In this paper, we propose an innovative system based on an End-to-End architecture with a Conformer backbone. Our system is specifically designed to achieve high accuracy in noisy industrial environments and to guarantee a minimal computational burden to meet stringent real-time requirements while running on computing devices that are embedded in robots. In order to increase the generalization capability of the system, the training procedure is driven by a Curriculum Learning strategy combined with dynamic data augmentation techniques, that progressively increase the complexity of input samples by increasing the noise during the training phase. We have conducted extensive experimentation to assess the effectiveness of our system, using a dataset composed of more than 50,000 samples, of which about 2,000 have been acquired during the daily operations of a Stellantis Italian factory. The results confirm the suitability of the proposed approach to be adopted in a real industrial environment; indeed, it is able to achieve, on both English and Italian commands, an accuracy higher than 90%, maintaining a compact model size (the network is 1.81 $M B$ ) and running in real-time on an industrial embedded device (namely $41 ms$ over an NVIDIA Xavier NX).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Communications 工程技术-电信学

CiteScore

14.10

自引率

5.00%

发文量

397

审稿时长

66 days

期刊介绍： Computer and Communications networks are key infrastructures of the information society with high socio-economic value as they contribute to the correct operations of many critical services (from healthcare to finance and transportation). Internet is the core of today''s computer-communication infrastructures. This has transformed the Internet, from a robust network for data transfer between computers, to a global, content-rich, communication and information system where contents are increasingly generated by the users, and distributed according to human social relations. Next-generation network technologies, architectures and protocols are therefore required to overcome the limitations of the legacy Internet and add new capabilities and services. The future Internet should be ubiquitous, secure, resilient, and closer to human communication paradigms. Computer Communications is a peer-reviewed international journal that publishes high-quality scientific articles (both theory and practice) and survey papers covering all aspects of future computer communication networks (on all layers, except the physical layer), with a special attention to the evolution of the Internet architecture, protocols, services, and applications.