Recognizing of Vocal Fold Disorders From High Speed Video: Use of Spatio-Temporal Deep Neural Networks

IF 2.5 4区计算机科学 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

International Journal of Imaging Systems and Technology Pub Date : 2025-08-04 DOI:10.1002/ima.70170

Dhouha Attia, Amel Benazza-Benyahia

{"title":"Recognizing of Vocal Fold Disorders From High Speed Video: Use of Spatio-Temporal Deep Neural Networks","authors":"Dhouha Attia, Amel Benazza-Benyahia","doi":"10.1002/ima.70170","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>This work aims at designing advanced computer-aided diagnosis systems leveraging deep learning approaches to identify vocal fold (VF) disorders by analyzing high-speed videos of the laryngeal area. The challenges lie in the high dimensionality of video data, the need for precise temporal resolution to capture rapid glottal dynamics, and the inherent variability in VF motion across individuals. Additionally, distinguishing pathological patterns from normal variations remains a complex task due to subtle and overlapping disorder characteristics. The primary objective of this research lies in showcasing the improvement in classification performance achieved when both temporal and spatial information is incorporated into the analysis. Temporal information, in particular, plays a crucial role when combined with spatial data, as it provides a more comprehensive understanding of dynamic vocal fold behaviors. To address this issue, we highlight the importance of creating specifically designed inputs for the deep neural network that capture the temporal dynamics of the glottal cycle. This ensures that the temporal variability inherent in the glottal cycle is appropriately represented in the input data. A key innovative aspect of this work involves the exploration and evaluation of various spatio-temporal deep learning architectures. These models are systematically compared to traditional architectures that rely solely on spatial information. The comparative analysis aims to determine to what extent incorporating temporal information can improve diagnostic accuracy. Among the tested models, the transformer-based architectures ViViT and TimeSformer achieve the best objective performance in terms of F1-score (around 0.93), ViViT having the least weight. In summary, this paper underscores the importance of utilizing spatio-temporal information from the region of interest for more effective identification of VF disorders. Using both 3D deep learning models and transformer-based architectures, our approach offers a robust solution to diagnose vocal fold pathologies, paving the way for future advancements in computer-aided medical diagnostics.</p>\n </div>","PeriodicalId":14027,"journal":{"name":"International Journal of Imaging Systems and Technology","volume":"35 5","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Imaging Systems and Technology","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ima.70170","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

This work aims at designing advanced computer-aided diagnosis systems leveraging deep learning approaches to identify vocal fold (VF) disorders by analyzing high-speed videos of the laryngeal area. The challenges lie in the high dimensionality of video data, the need for precise temporal resolution to capture rapid glottal dynamics, and the inherent variability in VF motion across individuals. Additionally, distinguishing pathological patterns from normal variations remains a complex task due to subtle and overlapping disorder characteristics. The primary objective of this research lies in showcasing the improvement in classification performance achieved when both temporal and spatial information is incorporated into the analysis. Temporal information, in particular, plays a crucial role when combined with spatial data, as it provides a more comprehensive understanding of dynamic vocal fold behaviors. To address this issue, we highlight the importance of creating specifically designed inputs for the deep neural network that capture the temporal dynamics of the glottal cycle. This ensures that the temporal variability inherent in the glottal cycle is appropriately represented in the input data. A key innovative aspect of this work involves the exploration and evaluation of various spatio-temporal deep learning architectures. These models are systematically compared to traditional architectures that rely solely on spatial information. The comparative analysis aims to determine to what extent incorporating temporal information can improve diagnostic accuracy. Among the tested models, the transformer-based architectures ViViT and TimeSformer achieve the best objective performance in terms of F1-score (around 0.93), ViViT having the least weight. In summary, this paper underscores the importance of utilizing spatio-temporal information from the region of interest for more effective identification of VF disorders. Using both 3D deep learning models and transformer-based architectures, our approach offers a robust solution to diagnose vocal fold pathologies, paving the way for future advancements in computer-aided medical diagnostics.

查看原文本刊更多论文

从高速视频中识别声带障碍：使用时空深度神经网络

本研究旨在设计先进的计算机辅助诊断系统，利用深度学习方法通过分析喉部的高速视频来识别声带（VF）疾病。挑战在于视频数据的高维性，需要精确的时间分辨率来捕捉快速的声门动态，以及个体间VF运动的固有变异性。此外，由于微妙和重叠的疾病特征，区分病理模式和正常变异仍然是一项复杂的任务。本研究的主要目的在于展示将时间和空间信息结合到分析中时所取得的分类性能的改进。特别是时间信息，当与空间数据相结合时，它起着至关重要的作用，因为它提供了对动态声带行为的更全面的理解。为了解决这个问题，我们强调了为深度神经网络创建专门设计的输入的重要性，这些输入可以捕捉声门周期的时间动态。这确保了声门周期中固有的时间变异性在输入数据中得到适当的表示。这项工作的一个关键创新方面涉及对各种时空深度学习架构的探索和评估。这些模型与仅依赖空间信息的传统架构进行了系统的比较。对比分析的目的是确定在多大程度上结合时间信息可以提高诊断的准确性。在测试的模型中，基于变压器的架构ViViT和TimeSformer在f1得分方面达到了最佳的客观性能（约为0.93），ViViT具有最小的权重。总之，本文强调了利用感兴趣区域的时空信息对更有效地识别VF疾病的重要性。使用3D深度学习模型和基于变压器的架构，我们的方法提供了一个强大的解决方案来诊断声带病变，为计算机辅助医疗诊断的未来发展铺平了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Imaging Systems and Technology 工程技术-成像科学与照相技术

CiteScore

6.90

自引率

6.10%

发文量

138

审稿时长

3 months

期刊介绍： The International Journal of Imaging Systems and Technology (IMA) is a forum for the exchange of ideas and results relevant to imaging systems, including imaging physics and informatics. The journal covers all imaging modalities in humans and animals. IMA accepts technically sound and scientifically rigorous research in the interdisciplinary field of imaging, including relevant algorithmic research and hardware and software development, and their applications relevant to medical research. The journal provides a platform to publish original research in structural and functional imaging. The journal is also open to imaging studies of the human body and on animals that describe novel diagnostic imaging and analyses methods. Technical, theoretical, and clinical research in both normal and clinical populations is encouraged. Submissions describing methods, software, databases, replication studies as well as negative results are also considered. The scope of the journal includes, but is not limited to, the following in the context of biomedical research: Imaging and neuro-imaging modalities: structural MRI, functional MRI, PET, SPECT, CT, ultrasound, EEG, MEG, NIRS etc.; Neuromodulation and brain stimulation techniques such as TMS and tDCS; Software and hardware for imaging, especially related to human and animal health; Image segmentation in normal and clinical populations; Pattern analysis and classification using machine learning techniques; Computational modeling and analysis; Brain connectivity and connectomics; Systems-level characterization of brain function; Neural networks and neurorobotics; Computer vision, based on human/animal physiology; Brain-computer interface (BCI) technology; Big data, databasing and data mining.