{"title":"Recognizing of Vocal Fold Disorders From High Speed Video: Use of Spatio-Temporal Deep Neural Networks","authors":"Dhouha Attia, Amel Benazza-Benyahia","doi":"10.1002/ima.70170","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>This work aims at designing advanced computer-aided diagnosis systems leveraging deep learning approaches to identify vocal fold (VF) disorders by analyzing high-speed videos of the laryngeal area. The challenges lie in the high dimensionality of video data, the need for precise temporal resolution to capture rapid glottal dynamics, and the inherent variability in VF motion across individuals. Additionally, distinguishing pathological patterns from normal variations remains a complex task due to subtle and overlapping disorder characteristics. The primary objective of this research lies in showcasing the improvement in classification performance achieved when both temporal and spatial information is incorporated into the analysis. Temporal information, in particular, plays a crucial role when combined with spatial data, as it provides a more comprehensive understanding of dynamic vocal fold behaviors. To address this issue, we highlight the importance of creating specifically designed inputs for the deep neural network that capture the temporal dynamics of the glottal cycle. This ensures that the temporal variability inherent in the glottal cycle is appropriately represented in the input data. A key innovative aspect of this work involves the exploration and evaluation of various spatio-temporal deep learning architectures. These models are systematically compared to traditional architectures that rely solely on spatial information. The comparative analysis aims to determine to what extent incorporating temporal information can improve diagnostic accuracy. Among the tested models, the transformer-based architectures ViViT and TimeSformer achieve the best objective performance in terms of F1-score (around 0.93), ViViT having the least weight. In summary, this paper underscores the importance of utilizing spatio-temporal information from the region of interest for more effective identification of VF disorders. Using both 3D deep learning models and transformer-based architectures, our approach offers a robust solution to diagnose vocal fold pathologies, paving the way for future advancements in computer-aided medical diagnostics.</p>\n </div>","PeriodicalId":14027,"journal":{"name":"International Journal of Imaging Systems and Technology","volume":"35 5","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Imaging Systems and Technology","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ima.70170","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
This work aims at designing advanced computer-aided diagnosis systems leveraging deep learning approaches to identify vocal fold (VF) disorders by analyzing high-speed videos of the laryngeal area. The challenges lie in the high dimensionality of video data, the need for precise temporal resolution to capture rapid glottal dynamics, and the inherent variability in VF motion across individuals. Additionally, distinguishing pathological patterns from normal variations remains a complex task due to subtle and overlapping disorder characteristics. The primary objective of this research lies in showcasing the improvement in classification performance achieved when both temporal and spatial information is incorporated into the analysis. Temporal information, in particular, plays a crucial role when combined with spatial data, as it provides a more comprehensive understanding of dynamic vocal fold behaviors. To address this issue, we highlight the importance of creating specifically designed inputs for the deep neural network that capture the temporal dynamics of the glottal cycle. This ensures that the temporal variability inherent in the glottal cycle is appropriately represented in the input data. A key innovative aspect of this work involves the exploration and evaluation of various spatio-temporal deep learning architectures. These models are systematically compared to traditional architectures that rely solely on spatial information. The comparative analysis aims to determine to what extent incorporating temporal information can improve diagnostic accuracy. Among the tested models, the transformer-based architectures ViViT and TimeSformer achieve the best objective performance in terms of F1-score (around 0.93), ViViT having the least weight. In summary, this paper underscores the importance of utilizing spatio-temporal information from the region of interest for more effective identification of VF disorders. Using both 3D deep learning models and transformer-based architectures, our approach offers a robust solution to diagnose vocal fold pathologies, paving the way for future advancements in computer-aided medical diagnostics.
期刊介绍:
The International Journal of Imaging Systems and Technology (IMA) is a forum for the exchange of ideas and results relevant to imaging systems, including imaging physics and informatics. The journal covers all imaging modalities in humans and animals.
IMA accepts technically sound and scientifically rigorous research in the interdisciplinary field of imaging, including relevant algorithmic research and hardware and software development, and their applications relevant to medical research. The journal provides a platform to publish original research in structural and functional imaging.
The journal is also open to imaging studies of the human body and on animals that describe novel diagnostic imaging and analyses methods. Technical, theoretical, and clinical research in both normal and clinical populations is encouraged. Submissions describing methods, software, databases, replication studies as well as negative results are also considered.
The scope of the journal includes, but is not limited to, the following in the context of biomedical research:
Imaging and neuro-imaging modalities: structural MRI, functional MRI, PET, SPECT, CT, ultrasound, EEG, MEG, NIRS etc.;
Neuromodulation and brain stimulation techniques such as TMS and tDCS;
Software and hardware for imaging, especially related to human and animal health;
Image segmentation in normal and clinical populations;
Pattern analysis and classification using machine learning techniques;
Computational modeling and analysis;
Brain connectivity and connectomics;
Systems-level characterization of brain function;
Neural networks and neurorobotics;
Computer vision, based on human/animal physiology;
Brain-computer interface (BCI) technology;
Big data, databasing and data mining.