A Robust Speech Features Extractor & Reconstructor For Artificial Intelligence Frontends

{"title":"A Robust Speech Features Extractor & Reconstructor For Artificial Intelligence Frontends","authors":"","doi":"10.33140/jhss.05.03.06","DOIUrl":null,"url":null,"abstract":"Human speech consists mainly of three components: a glottal signal, a vocal tract response, and a harmonic shift. The three respectively correlate with the intonation (pitch), the formants (timbre), and the speech resolution (depth). Adding the intonation of the Fundamental Frequency (FF) to Automatic Speech Recognition (ASR) systems is necessary. First, the intonation conveys a primitive paralanguage. Second, its speaker-tuning reduces background noises to clarify acoustic observations. Third, extracting the speech features is more efficient when they are computed together at the same time. This work introduces a frequency-modulation model, a novel quefrency-based speech feature extraction that is named Speech Quefrency Transform (SQT), and its proper quefrency scaling and transformation function. The cepstrums, which are spectrums of spectrums, are suggested in time unit accelerations, whereby the discrete variable, the quefrency, is measured in Hertz-per-microsecond. The extracted features are comparable to Mel-Frequency Cepstral Coefficients (MFCC) integrated within a quefrency-based pitch tracker. The SQT transform directly expands time samples of stationary signals (i.e., speech) to a higher dimensional space, which can help generative Artificial Neural Networks (ANNs) in unsupervised Machine Learning and Natural Language Processing (NLP) tasks. The proposed methodologies, which are a scalable solution that is compatible with dynamic and parallel programming for refined speech and cepstral analysis, can robustly estimate the features after applying a matrix multiplication in less than a hundred sub-bands, preserving precious computational resources.","PeriodicalId":267360,"journal":{"name":"Journal of Humanities & Social Sciences","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Humanities & Social Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33140/jhss.05.03.06","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Human speech consists mainly of three components: a glottal signal, a vocal tract response, and a harmonic shift. The three respectively correlate with the intonation (pitch), the formants (timbre), and the speech resolution (depth). Adding the intonation of the Fundamental Frequency (FF) to Automatic Speech Recognition (ASR) systems is necessary. First, the intonation conveys a primitive paralanguage. Second, its speaker-tuning reduces background noises to clarify acoustic observations. Third, extracting the speech features is more efficient when they are computed together at the same time. This work introduces a frequency-modulation model, a novel quefrency-based speech feature extraction that is named Speech Quefrency Transform (SQT), and its proper quefrency scaling and transformation function. The cepstrums, which are spectrums of spectrums, are suggested in time unit accelerations, whereby the discrete variable, the quefrency, is measured in Hertz-per-microsecond. The extracted features are comparable to Mel-Frequency Cepstral Coefficients (MFCC) integrated within a quefrency-based pitch tracker. The SQT transform directly expands time samples of stationary signals (i.e., speech) to a higher dimensional space, which can help generative Artificial Neural Networks (ANNs) in unsupervised Machine Learning and Natural Language Processing (NLP) tasks. The proposed methodologies, which are a scalable solution that is compatible with dynamic and parallel programming for refined speech and cepstral analysis, can robustly estimate the features after applying a matrix multiplication in less than a hundred sub-bands, preserving precious computational resources.
一种面向人工智能前端的鲁棒语音特征提取与重构方法
人类语言主要由三个部分组成:声门信号、声道反应和谐波转移。这三者分别与语调(音高)、共振峰(音色)和语音分辨率(深度)相关。在自动语音识别(ASR)系统中加入基频(FF)的语调是必要的。首先,语调传达了一种原始的副语言。其次,它的扬声器调谐减少背景噪音,以澄清声学观察。第三,同时计算语音特征,提取效率更高。本文介绍了一种调频模型,一种新的基于频率的语音特征提取方法——语音频率变换(SQT),以及相应的频率缩放和变换函数。倒频谱,即频谱的频谱,以时间单位加速度表示,而离散变量,频率,以赫兹每微秒来测量。提取的特征可与Mel-Frequency Cepstral系数(MFCC)相媲美,该系数集成在基于频率的基音跟踪器中。SQT变换直接将平稳信号(即语音)的时间样本扩展到更高的维度空间,这可以帮助生成式人工神经网络(ann)进行无监督机器学习和自然语言处理(NLP)任务。所提出的方法是一种可扩展的解决方案,可兼容用于精细语音和倒谱分析的动态并行编程,可以在不到一百个子带中应用矩阵乘法后稳健地估计特征,节省宝贵的计算资源。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信