Open-Source Manually Annotated Vocal Tract Database for Automatic Segmentation from 3D MRI Using Deep Learning: Benchmarking 2D and 3D Convolutional and Transformer Networks.

IF 2.5 4区 医学 Q1 AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY
Subin Erattakulangara, Karthika Kelat, Katie Burnham, Rachel Balbi, Sarah E Gerard, David Meyer, Sajan Goud Lingala
{"title":"Open-Source Manually Annotated Vocal Tract Database for Automatic Segmentation from 3D MRI Using Deep Learning: Benchmarking 2D and 3D Convolutional and Transformer Networks.","authors":"Subin Erattakulangara, Karthika Kelat, Katie Burnham, Rachel Balbi, Sarah E Gerard, David Meyer, Sajan Goud Lingala","doi":"10.1016/j.jvoice.2025.02.026","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>Accurate segmentation of the vocal tract from MRI data is essential for various voice, speech, and singing applications. Manual segmentation is time-intensive and susceptible to errors. This study aimed to evaluate the efficacy of deep learning algorithms for automatic vocal tract segmentation from 3D MRI.</p><p><strong>Study design: </strong>This study employed a comparative design, evaluating four deep learning architectures for vocal tract segmentation using an open-source dataset of 3D-MRI scans of French speakers.</p><p><strong>Methods: </strong>Fifty-three vocal tract volumes from 10 French speakers were manually annotated by an expert vocologist, assisted by two graduate students in voice science. These included 21 unique French phonemes and three unique voiceless tasks. Four state-of-the-art deep learning segmentation algorithms were evaluated: 2D slice-by-slice U-Net, 3D U-Net, 3D U-Net with transfer learning (pre-trained on lung CT), and 3D transformer U-Net (3D U-NetR). The STAPLE algorithm, which combines segmentations from multiple annotators to generate a probabilistic estimate of the true segmentation, was used to create reference segmentations for evaluation. Model performance was assessed using the Dice coefficient, Hausdorff distance, and structural similarity index measure.</p><p><strong>Results: </strong>The 3D U-Net and 3D U-Net with transfer learning models achieved the highest Dice coefficients (0.896±0.05 and 0.896±0.04, respectively). The 3D U-Net with transfer learning performed comparably to the 3D U-Net while using less than half the training data. It, along with the 2D slice-by-slice U-Net models, demonstrated lower variability in HD distance compared to the 3D U-Net and 3D U-NetR models. All models exhibited challenges in segmenting certain sounds, particularly /kõn/. Qualitative assessment by a voice expert revealed anatomically correct segmentations in the oropharyngeal and laryngopharyngeal spaces for all models, except the 2D slice-by-slice U-NET, and frequent errors with all models near bony regions (eg, teeth).</p><p><strong>Conclusions: </strong>This study emphasizes the effectiveness of 3D convolutional networks, especially with transfer learning, for automatic vocal tract segmentation from 3D MRI. Future research should focus on improving the segmentation of challenging vocal tract configurations and refining boundary delineations.</p>","PeriodicalId":49954,"journal":{"name":"Journal of Voice","volume":" ","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Voice","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jvoice.2025.02.026","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives: Accurate segmentation of the vocal tract from MRI data is essential for various voice, speech, and singing applications. Manual segmentation is time-intensive and susceptible to errors. This study aimed to evaluate the efficacy of deep learning algorithms for automatic vocal tract segmentation from 3D MRI.

Study design: This study employed a comparative design, evaluating four deep learning architectures for vocal tract segmentation using an open-source dataset of 3D-MRI scans of French speakers.

Methods: Fifty-three vocal tract volumes from 10 French speakers were manually annotated by an expert vocologist, assisted by two graduate students in voice science. These included 21 unique French phonemes and three unique voiceless tasks. Four state-of-the-art deep learning segmentation algorithms were evaluated: 2D slice-by-slice U-Net, 3D U-Net, 3D U-Net with transfer learning (pre-trained on lung CT), and 3D transformer U-Net (3D U-NetR). The STAPLE algorithm, which combines segmentations from multiple annotators to generate a probabilistic estimate of the true segmentation, was used to create reference segmentations for evaluation. Model performance was assessed using the Dice coefficient, Hausdorff distance, and structural similarity index measure.

Results: The 3D U-Net and 3D U-Net with transfer learning models achieved the highest Dice coefficients (0.896±0.05 and 0.896±0.04, respectively). The 3D U-Net with transfer learning performed comparably to the 3D U-Net while using less than half the training data. It, along with the 2D slice-by-slice U-Net models, demonstrated lower variability in HD distance compared to the 3D U-Net and 3D U-NetR models. All models exhibited challenges in segmenting certain sounds, particularly /kõn/. Qualitative assessment by a voice expert revealed anatomically correct segmentations in the oropharyngeal and laryngopharyngeal spaces for all models, except the 2D slice-by-slice U-NET, and frequent errors with all models near bony regions (eg, teeth).

Conclusions: This study emphasizes the effectiveness of 3D convolutional networks, especially with transfer learning, for automatic vocal tract segmentation from 3D MRI. Future research should focus on improving the segmentation of challenging vocal tract configurations and refining boundary delineations.

求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Voice
Journal of Voice 医学-耳鼻喉科学
CiteScore
4.00
自引率
13.60%
发文量
395
审稿时长
59 days
期刊介绍: The Journal of Voice is widely regarded as the world''s premiere journal for voice medicine and research. This peer-reviewed publication is listed in Index Medicus and is indexed by the Institute for Scientific Information. The journal contains articles written by experts throughout the world on all topics in voice sciences, voice medicine and surgery, and speech-language pathologists'' management of voice-related problems. The journal includes clinical articles, clinical research, and laboratory research. Members of the Foundation receive the journal as a benefit of membership.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信