Multimodal laryngoscopic video analysis for assisted diagnosis of vocal fold paralysis

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-10-06 DOI:10.1016/j.csl.2025.101891

Yucong Zhang , Xin Zou , Jinshan Yang , Wenjun Chen , Juan Liu , Faya Liang , Ming Li

{"title":"Multimodal laryngoscopic video analysis for assisted diagnosis of vocal fold paralysis","authors":"Yucong Zhang , Xin Zou , Jinshan Yang , Wenjun Chen , Juan Liu , Faya Liang , Ming Li","doi":"10.1016/j.csl.2025.101891","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS),<span><span><sup>2</sup></span></span> a novel system that leverages both audio and video data to automatically extract key video segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Beyond key video segment extraction from the raw laryngeal videos, MLVAS is able to generate effective audio and visual features for Vocal Fold Paralysis (VFP) detection. Pre-trained audio encoders are utilized to encode the patient voice to get the audio features. Visual features are generated by measuring the angle deviation of both the left and right vocal folds to the estimated glottal midline on the segmented glottis masks. To get better masks, we introduce a diffusion-based refinement that follows traditional U-Net segmentation to reduce false positives. We conducted several ablation studies to demonstrate the effectiveness of each module and modalities in the proposed MLVAS. The experimental results on a public segmentation dataset show the effectiveness of our proposed segmentation module. In addition, unilateral VFP classification results on a real-world clinic dataset demonstrate MLVAS’s ability of providing reliable and objective metrics as well as visualization for assisted clinical diagnosis.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101891"},"PeriodicalIF":3.4000,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825001160","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS),² a novel system that leverages both audio and video data to automatically extract key video segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Beyond key video segment extraction from the raw laryngeal videos, MLVAS is able to generate effective audio and visual features for Vocal Fold Paralysis (VFP) detection. Pre-trained audio encoders are utilized to encode the patient voice to get the audio features. Visual features are generated by measuring the angle deviation of both the left and right vocal folds to the estimated glottal midline on the segmented glottis masks. To get better masks, we introduce a diffusion-based refinement that follows traditional U-Net segmentation to reduce false positives. We conducted several ablation studies to demonstrate the effectiveness of each module and modalities in the proposed MLVAS. The experimental results on a public segmentation dataset show the effectiveness of our proposed segmentation module. In addition, unilateral VFP classification results on a real-world clinic dataset demonstrate MLVAS’s ability of providing reliable and objective metrics as well as visualization for assisted clinical diagnosis.

查看原文本刊更多论文

多模态喉镜视频分析对声带麻痹的辅助诊断

本文介绍了多模态喉镜视频分析系统（MLVAS），这是一个利用音频和视频数据从原始喉镜视频中自动提取关键视频片段和指标以辅助临床评估的新系统。该系统将基于视频的声门检测与音频关键字识别方法相结合，分析视频和音频数据，识别患者发声并精炼视频亮点，以确保对声带运动的最佳检查。除了从原始喉部视频中提取关键视频片段外，MLVAS还能够为声带麻痹（VFP）检测生成有效的音频和视觉特征。利用预训练的音频编码器对患者语音进行编码以获得音频特征。视觉特征是通过在分段声门掩膜上测量左右声带与估计的声门中线的角度偏差来产生的。为了获得更好的掩码，我们在传统的U-Net分割之后引入了基于扩散的细化，以减少误报。我们进行了几项消融研究，以证明在拟议的MLVAS中每个模块和模式的有效性。在公共分割数据集上的实验结果表明了所提出的分割模块的有效性。此外，在真实临床数据集上的单侧VFP分类结果表明，MLVAS能够为辅助临床诊断提供可靠和客观的指标以及可视化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.