视频内窥镜下声带姿态估计的深度学习方法。

Journal of imaging informatics in medicine Pub Date : 2025-02-12 DOI:10.1007/s10278-025-01431-8

Francesca Pia Villani, Maria Chiara Fiorentino, Lorenzo Federici, Cesare Piazza, Emanuele Frontoni, Alberto Paderno, Sara Moccia

{"title":"视频内窥镜下声带姿态估计的深度学习方法。","authors":"Francesca Pia Villani, Maria Chiara Fiorentino, Lorenzo Federici, Cesare Piazza, Emanuele Frontoni, Alberto Paderno, Sara Moccia","doi":"10.1007/s10278-025-01431-8","DOIUrl":null,"url":null,"abstract":"Accurate vocal fold (VF) pose estimation is crucial for diagnosing larynx diseases that can eventually lead to VF paralysis. The videoendoscopic examination is used to assess VF motility, usually estimating the change in the anterior glottic angle (AGA). This is a subjective and time-consuming procedure requiring extensive expertise. This research proposes a deep learning framework to estimate VF pose from laryngoscopy frames acquired in the actual clinical practice. The framework performs heatmap regression relying on three anatomically relevant keypoints as a prior for AGA computation, which is estimated from the coordinates of the predicted points. The assessment of the proposed framework is performed using a newly collected dataset of 471 laryngoscopy frames from 124 patients, 28 of whom with cancer. The framework was tested in various configurations and compared with other state-of-the-art approaches (direct keypoints regression and glottal segmentation) for both pose estimation, and AGA evaluation. The proposed framework obtained the lowest root mean square error (RMSE) computed on all the keypoints (5.09, 6.56, and 6.40 pixels, respectively) among all the models tested for VF pose estimation. Also for the AGA evaluation, heatmap regression reached the lowest mean average error (MAE) ( <math><mrow><mn>5</mn> <mo>.</mo> <msup><mn>87</mn> <mo>∘</mo></msup> </mrow> </math> ). Results show that relying on keypoints heatmap regression allows to perform VF pose estimation with a small error, overcoming drawbacks of state-of-the-art algorithms, especially in challenging images such as pathologic subjects, presence of noise, and occlusion.","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Deep-Learning Approach for Vocal Fold Pose Estimation in Videoendoscopy.\",\"authors\":\"Francesca Pia Villani, Maria Chiara Fiorentino, Lorenzo Federici, Cesare Piazza, Emanuele Frontoni, Alberto Paderno, Sara Moccia\",\"doi\":\"10.1007/s10278-025-01431-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accurate vocal fold (VF) pose estimation is crucial for diagnosing larynx diseases that can eventually lead to VF paralysis. The videoendoscopic examination is used to assess VF motility, usually estimating the change in the anterior glottic angle (AGA). This is a subjective and time-consuming procedure requiring extensive expertise. This research proposes a deep learning framework to estimate VF pose from laryngoscopy frames acquired in the actual clinical practice. The framework performs heatmap regression relying on three anatomically relevant keypoints as a prior for AGA computation, which is estimated from the coordinates of the predicted points. The assessment of the proposed framework is performed using a newly collected dataset of 471 laryngoscopy frames from 124 patients, 28 of whom with cancer. The framework was tested in various configurations and compared with other state-of-the-art approaches (direct keypoints regression and glottal segmentation) for both pose estimation, and AGA evaluation. The proposed framework obtained the lowest root mean square error (RMSE) computed on all the keypoints (5.09, 6.56, and 6.40 pixels, respectively) among all the models tested for VF pose estimation. Also for the AGA evaluation, heatmap regression reached the lowest mean average error (MAE) ( <math><mrow><mn>5</mn> <mo>.</mo> <msup><mn>87</mn> <mo>∘</mo></msup> </mrow> </math> ). Results show that relying on keypoints heatmap regression allows to perform VF pose estimation with a small error, overcoming drawbacks of state-of-the-art algorithms, especially in challenging images such as pathologic subjects, presence of noise, and occlusion.\",\"PeriodicalId\":516858,\"journal\":{\"name\":\"Journal of imaging informatics in medicine\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-02-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of imaging informatics in medicine\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s10278-025-01431-8\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of imaging informatics in medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10278-025-01431-8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

准确的声带（VF）姿态估计是诊断喉疾病的关键，最终可能导致VF瘫痪。视频内窥镜检查用于评估VF运动，通常估计声门前角（AGA）的变化。这是一个主观且耗时的过程，需要广泛的专业知识。本研究提出了一个深度学习框架，从实际临床实践中获得的喉镜框架中估计VF姿势。该框架执行热图回归依赖于三个解剖学上相关的关键点作为AGA计算的先验，这是从预测点的坐标估计的。对拟议框架的评估是使用新收集的来自124名患者的471个喉镜框架数据集进行的，其中28名患者患有癌症。该框架在各种配置下进行了测试，并与其他最先进的方法（直接关键点回归和声门分割）进行了比较，用于姿态估计和AGA评估。在所有模型中，该框架在所有关键点（分别为5.09、6.56和6.40像素）上计算的均方根误差（RMSE）最小。同样对于AGA评价，热图回归达到最低的平均误差（MAE）(5。87°)。结果表明，依靠关键点热图回归可以以较小的误差进行VF姿态估计，克服了最先进算法的缺点，特别是在具有挑战性的图像中，例如病理受试者，存在噪声和遮挡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Deep-Learning Approach for Vocal Fold Pose Estimation in Videoendoscopy.

Accurate vocal fold (VF) pose estimation is crucial for diagnosing larynx diseases that can eventually lead to VF paralysis. The videoendoscopic examination is used to assess VF motility, usually estimating the change in the anterior glottic angle (AGA). This is a subjective and time-consuming procedure requiring extensive expertise. This research proposes a deep learning framework to estimate VF pose from laryngoscopy frames acquired in the actual clinical practice. The framework performs heatmap regression relying on three anatomically relevant keypoints as a prior for AGA computation, which is estimated from the coordinates of the predicted points. The assessment of the proposed framework is performed using a newly collected dataset of 471 laryngoscopy frames from 124 patients, 28 of whom with cancer. The framework was tested in various configurations and compared with other state-of-the-art approaches (direct keypoints regression and glottal segmentation) for both pose estimation, and AGA evaluation. The proposed framework obtained the lowest root mean square error (RMSE) computed on all the keypoints (5.09, 6.56, and 6.40 pixels, respectively) among all the models tested for VF pose estimation. Also for the AGA evaluation, heatmap regression reached the lowest mean average error (MAE) ( $5 . 87^{\circ}$ ). Results show that relying on keypoints heatmap regression allows to perform VF pose estimation with a small error, overcoming drawbacks of state-of-the-art algorithms, especially in challenging images such as pathologic subjects, presence of noise, and occlusion.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of imaging informatics in medicine

自引率

0.00%

发文量