利用深度学习和小型标签语音数据集对病态声音进行自动 GRBAS 评分。

IF 2.5 4区 医学 Q1 AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY
Shunsuke Hidaka , Yogaku Lee , Moe Nakanishi , Kohei Wakamiya , Takashi Nakagawa , Tokihiko Kaburagi
{"title":"利用深度学习和小型标签语音数据集对病态声音进行自动 GRBAS 评分。","authors":"Shunsuke Hidaka ,&nbsp;Yogaku Lee ,&nbsp;Moe Nakanishi ,&nbsp;Kohei Wakamiya ,&nbsp;Takashi Nakagawa ,&nbsp;Tokihiko Kaburagi","doi":"10.1016/j.jvoice.2022.10.020","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><div><span>Auditory-perceptual evaluation frameworks, such as the grade-roughness-breathiness-asthenia-strain (GRBAS) scale, are the gold standard for the quantitative evaluation of pathological voice quality. However, the evaluation is subjective; thus, the ratings lack reproducibility due to inter- and intra-rater variation. Prior researchers have proposed deep-learning-based automatic GRBAS score estimation to address this problem. However, these methods require large amounts of labeled voice data. Therefore, this study investigates the potential of automatic GRBAS estimation using </span>deep learning with smaller amounts of data.</div></div><div><h3>Methods</h3><div>A dataset consisting of 300 pathological sustained /a/ vowel samples was created and rated by eight experts (200 for training, 50 for validation, and 50 for testing). A neural network<span> model that predicts the probability distribution of GRBAS scores from an onset-to-offset waveform was proposed. Random speed perturbation, random crop, and frequency masking were investigated as data augmentation techniques, and power, instantaneous frequency, and group delay were investigated as time-frequency representations.</span></div></div><div><h3>Results</h3><div>Five-fold cross-validation was conducted, and the automatic scoring performance was evaluated using the quadratic weighted Cohen's kappa. The results showed that the kappa values of the automatic scoring performance were comparable to those of the inter-rater reliability of experts for all GRBAS items and the intra-rater reliability of experts for items G, B, A, and S. Random speed perturbation was the most effective data augmentation technique overall. When data augmentation was applied, power was the most effective for items G, R, A, and S; for Item B, combining group delay and power yielded additional performance gains.</div></div><div><h3>Conclusion</h3><div>The automatic GRBAS scoring achieved by the proposed model using scant labeled data was comparable to that of experts. This suggests that the challenges resulting from insufficient data can be alleviated. The findings of this study can also contribute to performance improvements in other tasks such as automatic voice disorder detection.</div></div>","PeriodicalId":49954,"journal":{"name":"Journal of Voice","volume":"39 3","pages":"Pages 846.e1-846.e23"},"PeriodicalIF":2.5000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic GRBAS Scoring of Pathological Voices using Deep Learning and a Small Set of Labeled Voice Data\",\"authors\":\"Shunsuke Hidaka ,&nbsp;Yogaku Lee ,&nbsp;Moe Nakanishi ,&nbsp;Kohei Wakamiya ,&nbsp;Takashi Nakagawa ,&nbsp;Tokihiko Kaburagi\",\"doi\":\"10.1016/j.jvoice.2022.10.020\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objectives</h3><div><span>Auditory-perceptual evaluation frameworks, such as the grade-roughness-breathiness-asthenia-strain (GRBAS) scale, are the gold standard for the quantitative evaluation of pathological voice quality. However, the evaluation is subjective; thus, the ratings lack reproducibility due to inter- and intra-rater variation. Prior researchers have proposed deep-learning-based automatic GRBAS score estimation to address this problem. However, these methods require large amounts of labeled voice data. Therefore, this study investigates the potential of automatic GRBAS estimation using </span>deep learning with smaller amounts of data.</div></div><div><h3>Methods</h3><div>A dataset consisting of 300 pathological sustained /a/ vowel samples was created and rated by eight experts (200 for training, 50 for validation, and 50 for testing). A neural network<span> model that predicts the probability distribution of GRBAS scores from an onset-to-offset waveform was proposed. Random speed perturbation, random crop, and frequency masking were investigated as data augmentation techniques, and power, instantaneous frequency, and group delay were investigated as time-frequency representations.</span></div></div><div><h3>Results</h3><div>Five-fold cross-validation was conducted, and the automatic scoring performance was evaluated using the quadratic weighted Cohen's kappa. The results showed that the kappa values of the automatic scoring performance were comparable to those of the inter-rater reliability of experts for all GRBAS items and the intra-rater reliability of experts for items G, B, A, and S. Random speed perturbation was the most effective data augmentation technique overall. When data augmentation was applied, power was the most effective for items G, R, A, and S; for Item B, combining group delay and power yielded additional performance gains.</div></div><div><h3>Conclusion</h3><div>The automatic GRBAS scoring achieved by the proposed model using scant labeled data was comparable to that of experts. This suggests that the challenges resulting from insufficient data can be alleviated. The findings of this study can also contribute to performance improvements in other tasks such as automatic voice disorder detection.</div></div>\",\"PeriodicalId\":49954,\"journal\":{\"name\":\"Journal of Voice\",\"volume\":\"39 3\",\"pages\":\"Pages 846.e1-846.e23\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Voice\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0892199722003472\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Voice","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0892199722003472","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

目的:听觉感知评估框架,如嗓音粗细-呼吸-紧张-疲劳(GRBAS)量表,是定量评估病态嗓音质量的黄金标准。然而,这种评价是主观的;因此,由于评分者之间和评分者内部的差异,评分缺乏可重复性。针对这一问题,之前的研究人员提出了基于深度学习的 GRBAS 分数自动估算方法。然而,这些方法需要大量的标注语音数据。因此,本研究调查了使用深度学习在较小数据量下自动估算 GRBAS 的潜力:方法:创建了一个由 300 个病理性持续 /a/元音样本组成的数据集,并由八位专家进行评分(200 个用于训练,50 个用于验证,50 个用于测试)。我们提出了一个神经网络模型,该模型可预测从起音到落音波形的 GRBAS 分数的概率分布。随机速度扰动、随机裁剪和频率掩蔽作为数据增强技术进行了研究,功率、瞬时频率和群延迟作为时频表示进行了研究:结果:进行了五倍交叉验证,并使用二次加权科恩卡帕评估了自动评分性能。结果表明,在所有 GRBAS 项目中,自动评分性能的 kappa 值与专家评分间信度的 kappa 值相当;在 G、B、A 和 S 项目中,自动评分性能的 kappa 值与专家评分内信度的 kappa 值相当。在应用数据增强时,功率对项目 G、R、A 和 S 最为有效;对项目 B 而言,结合组延迟和功率可获得额外的性能提升:结论:所提出的模型在使用少量标记数据的情况下实现的 GRBAS 自动评分与专家评分不相上下。这表明,数据不足带来的挑战是可以缓解的。本研究的发现还有助于提高其他任务的性能,如自动语音紊乱检测。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Automatic GRBAS Scoring of Pathological Voices using Deep Learning and a Small Set of Labeled Voice Data

Objectives

Auditory-perceptual evaluation frameworks, such as the grade-roughness-breathiness-asthenia-strain (GRBAS) scale, are the gold standard for the quantitative evaluation of pathological voice quality. However, the evaluation is subjective; thus, the ratings lack reproducibility due to inter- and intra-rater variation. Prior researchers have proposed deep-learning-based automatic GRBAS score estimation to address this problem. However, these methods require large amounts of labeled voice data. Therefore, this study investigates the potential of automatic GRBAS estimation using deep learning with smaller amounts of data.

Methods

A dataset consisting of 300 pathological sustained /a/ vowel samples was created and rated by eight experts (200 for training, 50 for validation, and 50 for testing). A neural network model that predicts the probability distribution of GRBAS scores from an onset-to-offset waveform was proposed. Random speed perturbation, random crop, and frequency masking were investigated as data augmentation techniques, and power, instantaneous frequency, and group delay were investigated as time-frequency representations.

Results

Five-fold cross-validation was conducted, and the automatic scoring performance was evaluated using the quadratic weighted Cohen's kappa. The results showed that the kappa values of the automatic scoring performance were comparable to those of the inter-rater reliability of experts for all GRBAS items and the intra-rater reliability of experts for items G, B, A, and S. Random speed perturbation was the most effective data augmentation technique overall. When data augmentation was applied, power was the most effective for items G, R, A, and S; for Item B, combining group delay and power yielded additional performance gains.

Conclusion

The automatic GRBAS scoring achieved by the proposed model using scant labeled data was comparable to that of experts. This suggests that the challenges resulting from insufficient data can be alleviated. The findings of this study can also contribute to performance improvements in other tasks such as automatic voice disorder detection.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Voice
Journal of Voice 医学-耳鼻喉科学
CiteScore
4.00
自引率
13.60%
发文量
395
审稿时长
59 days
期刊介绍: The Journal of Voice is widely regarded as the world''s premiere journal for voice medicine and research. This peer-reviewed publication is listed in Index Medicus and is indexed by the Institute for Scientific Information. The journal contains articles written by experts throughout the world on all topics in voice sciences, voice medicine and surgery, and speech-language pathologists'' management of voice-related problems. The journal includes clinical articles, clinical research, and laboratory research. Members of the Foundation receive the journal as a benefit of membership.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信