Significance of relative phase features for shouted and normal speech classification

IF 1.9 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-01-06 DOI:10.1186/s13636-023-00324-4

Khomdet Phapatanaburi, Longbiao Wang, Meng Liu, Seiichi Nakagawa, Talit Jumphoo, Peerapong Uthansakul

{"title":"Significance of relative phase features for shouted and normal speech classification","authors":"Khomdet Phapatanaburi, Longbiao Wang, Meng Liu, Seiichi Nakagawa, Talit Jumphoo, Peerapong Uthansakul","doi":"10.1186/s13636-023-00324-4","DOIUrl":null,"url":null,"abstract":"Shouted and normal speech classification plays an important role in many speech-related applications. The existing works are often based on magnitude-based features and ignore phase-based features, which are directly related to magnitude information. In this paper, the importance of phase-based features is explored for the detection of shouted speech. The novel contributions of this work are as follows. (1) Three phase-based features, namely, relative phase (RP), linear prediction analysis estimated speech-based RP (LPAES-RP) and linear prediction residual-based RP (LPR-RP) features, are explored for shouted and normal speech classification. (2) We propose a new RP feature, called the glottal source-based RP (GRP) feature. The main idea of the proposed GRP feature is to exploit the difference between RP and LPAES-RP features to detect shouted speech. (3) A score combination of phase- and magnitude-based features is also employed to further improve the classification performance. The proposed feature and combination are evaluated using the shouted normal electroglottograph speech (SNE-Speech) corpus. The experimental findings show that the RP, LPAES-RP, and LPR-RP features provide promising results for the detection of shouted speech. We also find that the proposed GRP feature can provide better results than those of the standard mel-frequency cepstral coefficient (MFCC) feature. Moreover, compared to using individual features, the score combination of the MFCC and RP/LPAES-RP/LPR-RP/GRP features yields an improved detection performance. Performance analysis under noisy environments shows that the score combination of the MFCC and the RP/LPAES-RP/LPR-RP features gives more robust classification. These outcomes show the importance of RP features in distinguishing shouted speech from normal speech.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"31 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Audio Speech and Music Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13636-023-00324-4","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Shouted and normal speech classification plays an important role in many speech-related applications. The existing works are often based on magnitude-based features and ignore phase-based features, which are directly related to magnitude information. In this paper, the importance of phase-based features is explored for the detection of shouted speech. The novel contributions of this work are as follows. (1) Three phase-based features, namely, relative phase (RP), linear prediction analysis estimated speech-based RP (LPAES-RP) and linear prediction residual-based RP (LPR-RP) features, are explored for shouted and normal speech classification. (2) We propose a new RP feature, called the glottal source-based RP (GRP) feature. The main idea of the proposed GRP feature is to exploit the difference between RP and LPAES-RP features to detect shouted speech. (3) A score combination of phase- and magnitude-based features is also employed to further improve the classification performance. The proposed feature and combination are evaluated using the shouted normal electroglottograph speech (SNE-Speech) corpus. The experimental findings show that the RP, LPAES-RP, and LPR-RP features provide promising results for the detection of shouted speech. We also find that the proposed GRP feature can provide better results than those of the standard mel-frequency cepstral coefficient (MFCC) feature. Moreover, compared to using individual features, the score combination of the MFCC and RP/LPAES-RP/LPR-RP/GRP features yields an improved detection performance. Performance analysis under noisy environments shows that the score combination of the MFCC and the RP/LPAES-RP/LPR-RP features gives more robust classification. These outcomes show the importance of RP features in distinguishing shouted speech from normal speech.

查看原文本刊更多论文

相对相位特征对喊叫和正常语音分类的意义

在许多与语音相关的应用中，喊话和正常语音分类发挥着重要作用。现有研究通常基于幅度特征，而忽略了与幅度信息直接相关的相位特征。本文探讨了基于相位的特征对检测喊话语音的重要性。这项工作的新贡献如下。(1) 探索了三种基于相位的特征，即相对相位（RP）、基于线性预测分析估计语音的 RP（LPAES-RP）和基于线性预测残差的 RP（LPR-RP）特征，用于喊叫语音和正常语音的分类。(2) 我们提出了一种新的 RP 特征，称为基于声门源的 RP（GRP）特征。所提出的 GRP 特征的主要思想是利用 RP 和 LPAES-RP 特征之间的差异来检测喊叫语音。(3) 还采用了基于相位和幅度特征的得分组合，以进一步提高分类性能。利用喊话正常电图语音（SNE-Speech）语料库对所提出的特征和组合进行了评估。实验结果表明，RP、LPAES-RP 和 LPR-RP 特征在检测喊话语音方面效果良好。我们还发现，所提出的 GRP 特征比标准的 mel-frequency cepstral coefficient（MFCC）特征能提供更好的结果。此外，与使用单个特征相比，MFCC 和 RP/LPAES-RP/LPR-RP/GRP 特征的得分组合能提高检测性能。噪声环境下的性能分析表明，MFCC 和 RP/LPAES-RP/LPR-RP 特征的分数组合能提供更稳健的分类。这些结果表明了 RP 特征在区分喊叫语音和正常语音方面的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Eurasip Journal on Audio Speech and Music Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

4.10

自引率

4.20%

发文量

审稿时长

12 months

期刊介绍： The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.