A Robust Speech Features Extractor & Reconstructor For Artificial Intelligence Frontends

Journal of Humanities & Social Sciences Pub Date : 2022-07-24 DOI:10.33140/jhss.05.03.06

{"title":"A Robust Speech Features Extractor & Reconstructor For Artificial Intelligence Frontends","authors":"","doi":"10.33140/jhss.05.03.06","DOIUrl":null,"url":null,"abstract":"Human speech consists mainly of three components: a glottal signal, a vocal tract response, and a harmonic shift. The three respectively correlate with the intonation (pitch), the formants (timbre), and the speech resolution (depth). Adding the intonation of the Fundamental Frequency (FF) to Automatic Speech Recognition (ASR) systems is necessary. First, the intonation conveys a primitive paralanguage. Second, its speaker-tuning reduces background noises to clarify acoustic observations. Third, extracting the speech features is more efficient when they are computed together at the same time. This work introduces a frequency-modulation model, a novel quefrency-based speech feature extraction that is named Speech Quefrency Transform (SQT), and its proper quefrency scaling and transformation function. The cepstrums, which are spectrums of spectrums, are suggested in time unit accelerations, whereby the discrete variable, the quefrency, is measured in Hertz-per-microsecond. The extracted features are comparable to Mel-Frequency Cepstral Coefficients (MFCC) integrated within a quefrency-based pitch tracker. The SQT transform directly expands time samples of stationary signals (i.e., speech) to a higher dimensional space, which can help generative Artificial Neural Networks (ANNs) in unsupervised Machine Learning and Natural Language Processing (NLP) tasks. The proposed methodologies, which are a scalable solution that is compatible with dynamic and parallel programming for refined speech and cepstral analysis, can robustly estimate the features after applying a matrix multiplication in less than a hundred sub-bands, preserving precious computational resources.","PeriodicalId":267360,"journal":{"name":"Journal of Humanities & Social Sciences","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Humanities & Social Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33140/jhss.05.03.06","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Human speech consists mainly of three components: a glottal signal, a vocal tract response, and a harmonic shift. The three respectively correlate with the intonation (pitch), the formants (timbre), and the speech resolution (depth). Adding the intonation of the Fundamental Frequency (FF) to Automatic Speech Recognition (ASR) systems is necessary. First, the intonation conveys a primitive paralanguage. Second, its speaker-tuning reduces background noises to clarify acoustic observations. Third, extracting the speech features is more efficient when they are computed together at the same time. This work introduces a frequency-modulation model, a novel quefrency-based speech feature extraction that is named Speech Quefrency Transform (SQT), and its proper quefrency scaling and transformation function. The cepstrums, which are spectrums of spectrums, are suggested in time unit accelerations, whereby the discrete variable, the quefrency, is measured in Hertz-per-microsecond. The extracted features are comparable to Mel-Frequency Cepstral Coefficients (MFCC) integrated within a quefrency-based pitch tracker. The SQT transform directly expands time samples of stationary signals (i.e., speech) to a higher dimensional space, which can help generative Artificial Neural Networks (ANNs) in unsupervised Machine Learning and Natural Language Processing (NLP) tasks. The proposed methodologies, which are a scalable solution that is compatible with dynamic and parallel programming for refined speech and cepstral analysis, can robustly estimate the features after applying a matrix multiplication in less than a hundred sub-bands, preserving precious computational resources.

查看原文本刊更多论文

一种面向人工智能前端的鲁棒语音特征提取与重构方法

人类语言主要由三个部分组成:声门信号、声道反应和谐波转移。这三者分别与语调(音高)、共振峰(音色)和语音分辨率(深度)相关。在自动语音识别(ASR)系统中加入基频(FF)的语调是必要的。首先，语调传达了一种原始的副语言。其次，它的扬声器调谐减少背景噪音，以澄清声学观察。第三，同时计算语音特征，提取效率更高。本文介绍了一种调频模型，一种新的基于频率的语音特征提取方法——语音频率变换(SQT)，以及相应的频率缩放和变换函数。倒频谱，即频谱的频谱，以时间单位加速度表示，而离散变量，频率，以赫兹每微秒来测量。提取的特征可与Mel-Frequency Cepstral系数(MFCC)相媲美，该系数集成在基于频率的基音跟踪器中。SQT变换直接将平稳信号(即语音)的时间样本扩展到更高的维度空间，这可以帮助生成式人工神经网络(ann)进行无监督机器学习和自然语言处理(NLP)任务。所提出的方法是一种可扩展的解决方案，可兼容用于精细语音和倒谱分析的动态并行编程，可以在不到一百个子带中应用矩阵乘法后稳健地估计特征，节省宝贵的计算资源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Humanities & Social Sciences

自引率

0.00%

发文量