Formant Tracking by Combining Deep Neural Network and Linear Prediction

IF 2.7 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE open journal of signal processing Pub Date : 2025-01-16 DOI:10.1109/OJSP.2025.3530876

Sudarsana Reddy Kadiri;Kevin Huang;Christina Hagedorn;Dani Byrd;Paavo Alku;Shrikanth Narayanan

{"title":"Formant Tracking by Combining Deep Neural Network and Linear Prediction","authors":"Sudarsana Reddy Kadiri;Kevin Huang;Christina Hagedorn;Dani Byrd;Paavo Alku;Shrikanth Narayanan","doi":"10.1109/OJSP.2025.3530876","DOIUrl":null,"url":null,"abstract":"Formant tracking is an area of speech science that has recently undergone a technology shift from classical model-driven signal processing methods to modern data-driven deep learning methods. In this study, these two domains are combined in formant tracking by refining the formants estimated by a data-driven deep neural network (DNN) with formant estimates given by a model-driven linear prediction (LP) method. In the refinement process, the three lowest formants, initially estimated by the DNN-based method, are frame-wise replaced with local spectral peaks identified by the LP method. The LP-based refinement stage can be seamlessly integrated into the DNN without any training. As an LP method, the study advocates the use of quasiclosed phase forward-backward (QCP-FB) analysis. Three spectral representations are compared as DNN inputs: mel-frequency cepstral coefficients (MFCCs), the spectrogram, and the complex spectrogram. Formant tracking performance was evaluated by comparing the proposed refined DNN tracker with seven reference trackers, which included both signal processing and deep learning based methods. As evaluation data, ground truth formants of the Vocal Tract Resonance (VTR) corpus were used. The results demonstrate that the refined DNN trackers outperformed all conventional trackers. The best results were obtained by using the MFCC input for the DNN. The proposed MFCC refinement (MFCC-DNN<sub>QCP-FB</sub>) reduced estimation errors by 0.8 Hz, 12.9 Hz, and 11.7 Hz for the first (F1), second (F2), and third (F3) formants, respectively, compared to the Deep Formants refinement (DeepF<sub>QCP-FB</sub>). When compared to the model-driven KARMA tracking method, the proposed refinement reduced estimation errors by 2.3 Hz, 55.5 Hz, and 143.4 Hz for F1, F2, and F3, respectively. A detailed evaluation across various phonetic categories and gender groups showed that the proposed hybrid refinement approach improves formanttracking performance across most test conditions.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"222-230"},"PeriodicalIF":2.7000,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10843356","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of signal processing","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10843356/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Formant tracking is an area of speech science that has recently undergone a technology shift from classical model-driven signal processing methods to modern data-driven deep learning methods. In this study, these two domains are combined in formant tracking by refining the formants estimated by a data-driven deep neural network (DNN) with formant estimates given by a model-driven linear prediction (LP) method. In the refinement process, the three lowest formants, initially estimated by the DNN-based method, are frame-wise replaced with local spectral peaks identified by the LP method. The LP-based refinement stage can be seamlessly integrated into the DNN without any training. As an LP method, the study advocates the use of quasiclosed phase forward-backward (QCP-FB) analysis. Three spectral representations are compared as DNN inputs: mel-frequency cepstral coefficients (MFCCs), the spectrogram, and the complex spectrogram. Formant tracking performance was evaluated by comparing the proposed refined DNN tracker with seven reference trackers, which included both signal processing and deep learning based methods. As evaluation data, ground truth formants of the Vocal Tract Resonance (VTR) corpus were used. The results demonstrate that the refined DNN trackers outperformed all conventional trackers. The best results were obtained by using the MFCC input for the DNN. The proposed MFCC refinement (MFCC-DNN_QCP-FB) reduced estimation errors by 0.8 Hz, 12.9 Hz, and 11.7 Hz for the first (F1), second (F2), and third (F3) formants, respectively, compared to the Deep Formants refinement (DeepF_QCP-FB). When compared to the model-driven KARMA tracking method, the proposed refinement reduced estimation errors by 2.3 Hz, 55.5 Hz, and 143.4 Hz for F1, F2, and F3, respectively. A detailed evaluation across various phonetic categories and gender groups showed that the proposed hybrid refinement approach improves formanttracking performance across most test conditions.

查看原文本刊更多论文

结合深度神经网络和线性预测的峰群跟踪

声调跟踪是语音科学的一个领域，最近经历了从经典的模型驱动信号处理方法到现代数据驱动深度学习方法的技术转变。在本研究中，通过对数据驱动的深度神经网络（DNN）估计的声调与模型驱动的线性预测（LP）方法给出的声调估计值进行细化，将这两个领域结合到声调跟踪中。在细化过程中，最初由基于 DNN 的方法估算出的三个最低的声母会被 LP 方法识别出的局部频谱峰值逐帧替换。基于 LP 的细化阶段可无缝集成到 DNN 中，无需任何训练。作为一种 LP 方法，研究提倡使用准闭合相位前向后向（QCP-FB）分析。作为 DNN 输入，对三种频谱表示进行了比较：梅尔频率epstral系数（MFCC）、频谱图和复合频谱图。通过将所提出的改进型 DNN 跟踪器与七个参考跟踪器（包括基于信号处理和深度学习的方法）进行比较，对阵音跟踪性能进行了评估。作为评估数据，使用了声带共振（VTR）语料库的地面真实声母。结果表明，改进后的 DNN 追踪器优于所有传统追踪器。使用 MFCC 输入的 DNN 获得了最佳结果。与 Deep Formants refinement（DeepFQCP-FB）相比，拟议的 MFCC refinement（MFCC-DNNQCP-FB）将第一（F1）、第二（F2）和第三（F3）声母的估计误差分别降低了 0.8 Hz、12.9 Hz 和 11.7 Hz。与模型驱动的 KARMA 跟踪方法相比，所提出的改进方法将 F1、F2 和 F3 的估计误差分别降低了 2.3 Hz、55.5 Hz 和 143.4 Hz。对不同音素类别和性别组的详细评估表明，所提出的混合细化方法在大多数测试条件下都能提高声像跟踪性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊