基于时频域深度CNN的语音信号背景自动分类

Q1 Arts and Humanities

International Journal of Speech Technology Pub Date : 2023-09-01 DOI:10.1007/s10772-023-10042-z

Rakesh Reddy Yakkati, Sreenivasa Reddy Yeduri, Rajesh Kumar Tripathy, Linga Reddy Cenkeramaddi

{"title":"基于时频域深度CNN的语音信号背景自动分类","authors":"Rakesh Reddy Yakkati, Sreenivasa Reddy Yeduri, Rajesh Kumar Tripathy, Linga Reddy Cenkeramaddi","doi":"10.1007/s10772-023-10042-z","DOIUrl":null,"url":null,"abstract":"Abstract Many application areas, such as background identification, predictive maintenance in industrial applications, smart home applications, assisting deaf people with their daily activities and indexing and retrieval of content-based multimedia, etc., use automatic background classification using speech signals. It is challenging to predict the background environment accurately from speech signal information. Thus, a novel synchrosqueezed wavelet transform (SWT)-based deep learning (DL) approach is proposed in this paper for automatically classifying background information embedded in speech signals. Here, SWT is incorporated to obtain the time-frequency plot from the speech signals. These time-frequency signals are then fed to a deep convolutional neural network (DCNN) to classify background information embedded in speech signals. The proposed DCNN model consists of three convolution layers, one batch-normalization layer, three max-pooling layers, one dropout layer, and one fully connected layer. The proposed method is tested using various background signals embedded in speech signals, such as airport, airplane, drone, street, babble, car, helicopter, exhibition, station, restaurant, and train sounds. According to the results, the proposed SWT-based DCNN approach has an overall classification accuracy of 97.96 (± 0.53)% to classify background information embedded in speech signals. Finally, the performance of the proposed approach is compared to the existing methods.","PeriodicalId":14305,"journal":{"name":"International Journal of Speech Technology","volume":"2641 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Time frequency domain deep CNN for automatic background classification in speech signals\",\"authors\":\"Rakesh Reddy Yakkati, Sreenivasa Reddy Yeduri, Rajesh Kumar Tripathy, Linga Reddy Cenkeramaddi\",\"doi\":\"10.1007/s10772-023-10042-z\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Many application areas, such as background identification, predictive maintenance in industrial applications, smart home applications, assisting deaf people with their daily activities and indexing and retrieval of content-based multimedia, etc., use automatic background classification using speech signals. It is challenging to predict the background environment accurately from speech signal information. Thus, a novel synchrosqueezed wavelet transform (SWT)-based deep learning (DL) approach is proposed in this paper for automatically classifying background information embedded in speech signals. Here, SWT is incorporated to obtain the time-frequency plot from the speech signals. These time-frequency signals are then fed to a deep convolutional neural network (DCNN) to classify background information embedded in speech signals. The proposed DCNN model consists of three convolution layers, one batch-normalization layer, three max-pooling layers, one dropout layer, and one fully connected layer. The proposed method is tested using various background signals embedded in speech signals, such as airport, airplane, drone, street, babble, car, helicopter, exhibition, station, restaurant, and train sounds. According to the results, the proposed SWT-based DCNN approach has an overall classification accuracy of 97.96 (± 0.53)% to classify background information embedded in speech signals. Finally, the performance of the proposed approach is compared to the existing methods.\",\"PeriodicalId\":14305,\"journal\":{\"name\":\"International Journal of Speech Technology\",\"volume\":\"2641 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Speech Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s10772-023-10042-z\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Arts and Humanities\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Speech Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10772-023-10042-z","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Arts and Humanities","Score":null,"Total":0}

引用次数: 0

摘要

许多应用领域，如背景识别、工业应用中的预测性维护、智能家居应用、辅助聋人日常活动、基于内容的多媒体的索引和检索等，都使用语音信号进行自动背景分类。从语音信号信息中准确预测背景环境是一个挑战。为此，本文提出了一种基于同步压缩小波变换(SWT)的深度学习方法，用于语音信号背景信息的自动分类。在这里，结合SWT从语音信号中获得时频图。然后将这些时频信号送入深度卷积神经网络(DCNN)，对嵌入语音信号中的背景信息进行分类。提出的DCNN模型由三个卷积层、一个批量归一化层、三个最大池化层、一个dropout层和一个全连接层组成。利用嵌入在语音信号中的各种背景信号，如机场、飞机、无人机、街道、咿呀学语、汽车、直升机、展览、车站、餐馆和火车声音，对所提出的方法进行了测试。结果表明，基于swt的DCNN方法对嵌入在语音信号中的背景信息进行分类的总体准确率为97.96(±0.53)%。最后，将该方法的性能与现有方法进行了比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Time frequency domain deep CNN for automatic background classification in speech signals

Abstract Many application areas, such as background identification, predictive maintenance in industrial applications, smart home applications, assisting deaf people with their daily activities and indexing and retrieval of content-based multimedia, etc., use automatic background classification using speech signals. It is challenging to predict the background environment accurately from speech signal information. Thus, a novel synchrosqueezed wavelet transform (SWT)-based deep learning (DL) approach is proposed in this paper for automatically classifying background information embedded in speech signals. Here, SWT is incorporated to obtain the time-frequency plot from the speech signals. These time-frequency signals are then fed to a deep convolutional neural network (DCNN) to classify background information embedded in speech signals. The proposed DCNN model consists of three convolution layers, one batch-normalization layer, three max-pooling layers, one dropout layer, and one fully connected layer. The proposed method is tested using various background signals embedded in speech signals, such as airport, airplane, drone, street, babble, car, helicopter, exhibition, station, restaurant, and train sounds. According to the results, the proposed SWT-based DCNN approach has an overall classification accuracy of 97.96 (± 0.53)% to classify background information embedded in speech signals. Finally, the performance of the proposed approach is compared to the existing methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Speech Technology ENGINEERING, ELECTRICAL & ELECTRONIC-

CiteScore

5.00

自引率

0.00%

发文量

期刊介绍： The International Journal of Speech Technology is a research journal that focuses on speech technology and its applications. It promotes research and description on all aspects of speech input and output, including theory, experiment, testing, base technology, applications. The journal is an international forum for the dissemination of research related to the applications of speech technology as well as to the technology itself as it relates to real-world applications. Articles describing original work in all aspects of speech technology are included. Sample topics include but are not limited to the following: applications employing digitized speech, synthesized speech or automatic speech recognition technological issues of speech input or output human factors, intelligent interfaces, robust applications integration of aspects of artificial intelligence and natural language processing international and local language implementations of speech synthesis and recognition development of new algorithms interface description techniques, tools and languages testing of intelligibility, naturalness and accuracy computational issues in speech technology software development tools speech-enabled robotics speech technology as a diagnostic tool for treating language disorders voice technology for managing serious laryngeal disabilities the use of speech in multimedia This is the only journal which presents papers on both the base technology and theory as well as all varieties of applications. It encompasses all aspects of the three major technologies: text-to-speech synthesis, automatic speech recognition and stored (digitized) speech.