Phonetic information in the vowel spectrum: the meaning of mel-Frequency Cepstral Coefficients

IF 2.4 1区文学 0 LANGUAGE & LINGUISTICS

Journal of Phonetics Pub Date : 2025-07-17 DOI:10.1016/j.wocn.2025.101434

Khalil Iskarous , Alessandro Vietti

{"title":"Phonetic information in the vowel spectrum: the meaning of mel-Frequency Cepstral Coefficients","authors":"Khalil Iskarous , Alessandro Vietti","doi":"10.1016/j.wocn.2025.101434","DOIUrl":null,"url":null,"abstract":"<div><div>There is still disagreement in the acoustic phonetics literature on how phonetic information is encoded in the vowel acoustic spectrum. The “formant hypothesis” holds that formant frequency locations are the primary encoding of phonetic information. But perceptual experiments have shown that listeners can identify vowels, to a certain extent, even when formant peaks are suppressed. This has given rise to the “whole-spectrum” hypothesis, which describes each vowel segment in terms of a high-dimensional description of its entire spectrum. While the “whole-spectrum” hypothesis better predicts suppressed-formant vowel perception, one advantage of the “formant hypothesis” is that it parameterizes a vowel inventory of a language in terms of featural classes indexed by a few values of formant frequencies. These frequency scales serve to describe a language’s phonological organization and sound change. In this paper, we show that the mel-frequency Cepstral Coefficients (MFCCs), whole-spectrum parameterizations that have been used in speech technology from the 1970’s till today, also have a phonetic interpretation leading to the same featural classes as traditional description. This is despite the fact that for many decades they have been thought to not be interpretable. Our arguments are based on analyses of all vowel data from the TIMIT database, with large amounts of speaker, context, prosodic, and dialectal variability, using information theory, effect-size statistics, and Fourier theory. Our goal is to show that MFCCs can be useful for further developments in the field of acoustic phonetics, because while they extract phonetically-distinctive information from the entire spectrum, they can also further understanding of the linguistic structure of vowel spaces.</div></div>","PeriodicalId":51397,"journal":{"name":"Journal of Phonetics","volume":"112 ","pages":"Article 101434"},"PeriodicalIF":2.4000,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Phonetics","FirstCategoryId":"98","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0095447025000452","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}

引用次数: 0

Abstract

There is still disagreement in the acoustic phonetics literature on how phonetic information is encoded in the vowel acoustic spectrum. The “formant hypothesis” holds that formant frequency locations are the primary encoding of phonetic information. But perceptual experiments have shown that listeners can identify vowels, to a certain extent, even when formant peaks are suppressed. This has given rise to the “whole-spectrum” hypothesis, which describes each vowel segment in terms of a high-dimensional description of its entire spectrum. While the “whole-spectrum” hypothesis better predicts suppressed-formant vowel perception, one advantage of the “formant hypothesis” is that it parameterizes a vowel inventory of a language in terms of featural classes indexed by a few values of formant frequencies. These frequency scales serve to describe a language’s phonological organization and sound change. In this paper, we show that the mel-frequency Cepstral Coefficients (MFCCs), whole-spectrum parameterizations that have been used in speech technology from the 1970’s till today, also have a phonetic interpretation leading to the same featural classes as traditional description. This is despite the fact that for many decades they have been thought to not be interpretable. Our arguments are based on analyses of all vowel data from the TIMIT database, with large amounts of speaker, context, prosodic, and dialectal variability, using information theory, effect-size statistics, and Fourier theory. Our goal is to show that MFCCs can be useful for further developments in the field of acoustic phonetics, because while they extract phonetically-distinctive information from the entire spectrum, they can also further understanding of the linguistic structure of vowel spaces.

查看原文本刊更多论文

元音谱中的语音信息：mel-Frequency倒谱系数的意义

语音信息如何在元音声谱中编码，在声学语音学文献中仍存在分歧。“共振峰假说”认为共振峰频率位置是语音信息的主要编码。但感知实验表明，即使在形成峰被抑制的情况下，听众也能在一定程度上识别元音。这就产生了“全谱”假说，它用整个谱的高维描述来描述每个元音片段。虽然“全谱”假说能更好地预测被抑制的形成峰元音感知，但“形成峰假说”的一个优点是，它以几个形成峰频率值为索引的特征类别来参数化语言的元音清单。这些频率尺度用来描述一种语言的语音组织和声音变化。在本文中，我们展示了mel-frequency倒谱系数（MFCCs），即从20世纪70年代至今一直用于语音技术的全频谱参数化，也具有语音解释，导致与传统描述相同的特征类别。尽管几十年来它们一直被认为是不可解释的。我们的论点是基于对TIMIT数据库中所有元音数据的分析，使用信息论、效应大小统计和傅立叶理论，分析了大量的说话人、上下文、韵律和方言差异。我们的目标是证明mfcc在声学语音学领域的进一步发展是有用的，因为当它们从整个频谱中提取语音特征信息时，它们也可以进一步理解元音空间的语言结构。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Phonetics Multiple-

CiteScore

3.50

自引率

26.30%

发文量

期刊介绍： The Journal of Phonetics publishes papers of an experimental or theoretical nature that deal with phonetic aspects of language and linguistic communication processes. Papers dealing with technological and/or pathological topics, or papers of an interdisciplinary nature are also suitable, provided that linguistic-phonetic principles underlie the work reported. Regular articles, review articles, and letters to the editor are published. Themed issues are also published, devoted entirely to a specific subject of interest within the field of phonetics.