Learning and consolidating the contextualized contour representations of tones from F0 sequences and durational variations via transformers.

IF 2.1 2区 物理与天体物理 Q2 ACOUSTICS
Yi-Fen Liu, Xiang-Li Lu
{"title":"Learning and consolidating the contextualized contour representations of tones from F0 sequences and durational variations via transformers.","authors":"Yi-Fen Liu, Xiang-Li Lu","doi":"10.1121/10.0034359","DOIUrl":null,"url":null,"abstract":"<p><p>Many speech characteristics, including conventional acoustic features such as mel frequency cepstrum coefficients and mel-spectrograms, as well as pre-trained contextualized acoustic representations such as wav2vec2.0, are used in a deep neural network or successfully fine-tuned with a connectionist temporal classification for Mandarin tone classification. In this study, the authors propose a transformer-based tone classification architecture, TNet-Full, which uses estimated fundamental frequency (F0) values and aligned boundary information on syllables and words. Key components of the model framework are the contour encoder and rhythm encoder, as well as the cross-attention between contours and rhythms established in the interaction encoder. Using contextual tonal contours as a reference, as well as rhythmic information derived from duration variations to consolidate more on contour representations for tone recognition, TNet-Full achieves absolute performance improvements of 24.4% for read speech (from 71.4% to 95.8%) and 6.3% for conversational speech (from 52.1% to 58.4%) when compared to a naive, simple baseline transformer, TNet-base. The relative improvements are 34.2% and 12.1%. As humans perceive tones, contour abstractions of tones can only be derived from F0 sequences, and tone recognition would be improved if syllable temporal organization was stable and predictable instead of fluctuating as seen in conversations.</p>","PeriodicalId":17168,"journal":{"name":"Journal of the Acoustical Society of America","volume":"156 5","pages":"3353-3372"},"PeriodicalIF":2.1000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Acoustical Society of America","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.1121/10.0034359","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Many speech characteristics, including conventional acoustic features such as mel frequency cepstrum coefficients and mel-spectrograms, as well as pre-trained contextualized acoustic representations such as wav2vec2.0, are used in a deep neural network or successfully fine-tuned with a connectionist temporal classification for Mandarin tone classification. In this study, the authors propose a transformer-based tone classification architecture, TNet-Full, which uses estimated fundamental frequency (F0) values and aligned boundary information on syllables and words. Key components of the model framework are the contour encoder and rhythm encoder, as well as the cross-attention between contours and rhythms established in the interaction encoder. Using contextual tonal contours as a reference, as well as rhythmic information derived from duration variations to consolidate more on contour representations for tone recognition, TNet-Full achieves absolute performance improvements of 24.4% for read speech (from 71.4% to 95.8%) and 6.3% for conversational speech (from 52.1% to 58.4%) when compared to a naive, simple baseline transformer, TNet-base. The relative improvements are 34.2% and 12.1%. As humans perceive tones, contour abstractions of tones can only be derived from F0 sequences, and tone recognition would be improved if syllable temporal organization was stable and predictable instead of fluctuating as seen in conversations.

通过转换器从 F0 序列和持续时间变化中学习和巩固音调的上下文轮廓表征。
许多语音特征,包括传统的声学特征,如 mel 频率倒频谱系数和 mel 频谱图,以及预先训练的上下文化声学表示,如 wav2vec2.0,都被用于深度神经网络,或成功地用连接主义时序分类进行微调,以进行普通话音调分类。在本研究中,作者提出了一种基于变压器的音调分类架构 TNet-Full,该架构使用估计的基频(F0)值以及音节和词的对齐边界信息。模型框架的关键组成部分是轮廓编码器和节奏编码器,以及交互编码器中建立的轮廓和节奏之间的交叉注意。TNet-Full 使用上下文音调轮廓作为参考,并使用从持续时间变化中获得的节奏信息来加强音调识别的轮廓表示,与天真、简单的基线转换器 TNet-base 相比,阅读语音的绝对性能提高了 24.4%(从 71.4% 提高到 95.8%),对话语音的绝对性能提高了 6.3%(从 52.1% 提高到 58.4%)。相对改进幅度分别为 34.2% 和 12.1%。人类在感知音调时,音调的轮廓抽象只能从 F0 序列中得出,如果音节的时间组织是稳定和可预测的,而不是会话中的波动,那么音调识别率就会提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
4.60
自引率
16.70%
发文量
1433
审稿时长
4.7 months
期刊介绍: Since 1929 The Journal of the Acoustical Society of America has been the leading source of theoretical and experimental research results in the broad interdisciplinary study of sound. Subject coverage includes: linear and nonlinear acoustics; aeroacoustics, underwater sound and acoustical oceanography; ultrasonics and quantum acoustics; architectural and structural acoustics and vibration; speech, music and noise; psychology and physiology of hearing; engineering acoustics, transduction; bioacoustics, animal bioacoustics.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信