Optimized Deep Neural Networks Audio Tagging Framework for Virtual Business Assistant

IF 0.9 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Advances in Information Technology Pub Date : 2023-01-01 DOI:10.12720/jait.14.3.550-558

Fatma Sh. El-metwally, Ali I. Eldesouky, Nahla B. Abdel-Hamid, Sally M. Elghamrawy

{"title":"Optimized Deep Neural Networks Audio Tagging Framework for Virtual Business Assistant","authors":"Fatma Sh. El-metwally, Ali I. Eldesouky, Nahla B. Abdel-Hamid, Sally M. Elghamrawy","doi":"10.12720/jait.14.3.550-558","DOIUrl":null,"url":null,"abstract":"— A virtual assistant has a huge impact on business and an organizations development. It can be used to manage customer relations and deal with received queries, automatically reply to e-mails and phone calls.Audio signal processing has become increasingly popular since the development of virtual assistants. Deep learning and audio signal processing advancements have dramatically enhanced audio tagging. Audio Tagging (AT) is a challenge that requires eliciting descriptive labels from audio clips. This study proposes an Optimized Deep Neural Networks Audio Tagging Framework for Virtual Business Assistant to categorize and analyze audio tagging. Each input signal is used to extract the various audio tagging features. The extracted features are input into a neural network to carry out a multi-label classification for the predicted tags. Optimization techniques are used to improve the quality of the model fit for neural networks. To test the efficiency of the framework, four comparison experiments have been conducted between it and some of the others. From these results, it was concluded that this framework is better than the others in terms of efficiency. When the neural network was trained, Mel-Frequency Cepstral Coefficient (MFCC) features with Adamax achieved the best results with 93% accuracy and a 0.17% loss. When evaluating the performance of the model for seven labels, it achieved an average of precision 0.952, recall 0.952, F-score 0.951, accuracy 0.983, and an equal error rate of 0.015 in the evaluation set compared to the provided Detection and Classification of Acoustic Scenes and Events (DSCASE) baseline where he achieved and accuracy of 72.5% and","PeriodicalId":36452,"journal":{"name":"Journal of Advances in Information Technology","volume":"1 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Advances in Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12720/jait.14.3.550-558","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

— A virtual assistant has a huge impact on business and an organizations development. It can be used to manage customer relations and deal with received queries, automatically reply to e-mails and phone calls.Audio signal processing has become increasingly popular since the development of virtual assistants. Deep learning and audio signal processing advancements have dramatically enhanced audio tagging. Audio Tagging (AT) is a challenge that requires eliciting descriptive labels from audio clips. This study proposes an Optimized Deep Neural Networks Audio Tagging Framework for Virtual Business Assistant to categorize and analyze audio tagging. Each input signal is used to extract the various audio tagging features. The extracted features are input into a neural network to carry out a multi-label classification for the predicted tags. Optimization techniques are used to improve the quality of the model fit for neural networks. To test the efficiency of the framework, four comparison experiments have been conducted between it and some of the others. From these results, it was concluded that this framework is better than the others in terms of efficiency. When the neural network was trained, Mel-Frequency Cepstral Coefficient (MFCC) features with Adamax achieved the best results with 93% accuracy and a 0.17% loss. When evaluating the performance of the model for seven labels, it achieved an average of precision 0.952, recall 0.952, F-score 0.951, accuracy 0.983, and an equal error rate of 0.015 in the evaluation set compared to the provided Detection and Classification of Acoustic Scenes and Events (DSCASE) baseline where he achieved and accuracy of 72.5% and

查看原文本刊更多论文

优化的深度神经网络音频标记框架的虚拟商务助理

-虚拟助理对业务和组织的发展有着巨大的影响。它可以用来管理客户关系，处理收到的查询，自动回复电子邮件和电话。随着虚拟助手的发展，音频信号处理变得越来越流行。深度学习和音频信号处理的进步极大地增强了音频标记。音频标记(AT)是一项挑战，需要从音频片段中提取描述性标签。本研究提出一种优化的深度神经网络音频标注框架，用于虚拟商务助理对音频标注进行分类和分析。每个输入信号用于提取各种音频标记特征。将提取的特征输入到神经网络中，对预测的标签进行多标签分类。优化技术用于提高神经网络的模型拟合质量。为了验证该框架的有效性，我们将其与其他框架进行了四次对比实验。从这些结果中得出结论，该框架在效率方面优于其他框架。在训练神经网络时，使用Adamax的Mel-Frequency Cepstral Coefficient (MFCC)特征获得了最佳效果，准确率为93%，损失为0.17%。在评估七个标签的模型性能时，与提供的声学场景和事件检测和分类(DSCASE)基线相比，该模型在评估集中的平均精度为0.952，召回率0.952,f分数0.951，准确度0.983，错误率为0.015，其中他实现了72.5%和

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Advances in Information Technology Computer Science-Information Systems

CiteScore

4.20

自引率

20.00%

发文量