{"title":"Developing a smart system for binary classification of disordered voices using machine learning","authors":"Yat Chun Au, Manwa L. Ng","doi":"10.1016/j.amjoto.2025.104672","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><div>Voice disorder is characterized by disruptions in voice quality caused by issues in vocal fold vibration during phonation. The study explored the application of machine learning, based on the Random Forest (RF) and Decision Tree (DT) models, in the classification of normophonic and disordered voices using acoustic features. The RF and DT classifiers were compared, and the diagnostic utility of individual acoustic parameters was evaluated across multilingual databases, with an emphasis on Cantonese voice samples.</div></div><div><h3>Methods</h3><div>Sustained vowel /a/ recordings were extracted from the Saarbruecken Voice Database, the Perceptual Voice Qualities Database, and a local Cantonese clinical repository. A total of 1986 samples were used for training and testing. Twenty-nine acoustic features were extracted using Parselmouth, a Python interface to Praat. RF and DT models were trained on overseas data and validated on local Cantonese recordings. The RF and DT models were compared based on classification accuracy, sensitivity, specificity, and F1-score. Feature importance was assessed using Mean Decrease in Impurity (MDI) and Mean Decrease in Accuracy (MDA). Receiver Operating Characteristic (ROC) analysis was performed to evaluate the discriminative ability of each acoustic parameter by sex and dataset origin.</div></div><div><h3>Results</h3><div>The RF model outperformed the DT model, with RF achieving an accuracy of 89 %, precision of 79 %, and F1 score of 77 %, compared to 78 % accuracy and 61 % F1 score associated with DT. RF demonstrated superior true positive and negative rates, and lower false negative rates, making it more suitable for clinical applications. Acoustic feature analysis identified age, CSID, and shimmer and jitter measures as key contributors to classification performance. ROC analyses revealed that CSID and stdevF0Hz were reliable discriminators for male voices, while CSID, localabsoluteJitter, apq11Shimmer, and localdbShimmer demonstrated strong classification performance in female voices across all datasets. However, threshold variability between local and overseas datasets highlights the need for population-specific calibration.</div></div><div><h3>Conclusion</h3><div>This study underscores the potential of machine learning, particularly the RF algorithm, in enhancing the accuracy of voice disorder diagnosis by automating acoustic feature analysis. The integration of such models into clinical practice could offer more reliable, non-invasive methods for early detection and management of voice disorders, thus improving patient outcomes. Future research should focus on expanding dataset diversity and further validation to enhance the generalizability and clinical applicability of these findings.</div></div>","PeriodicalId":7591,"journal":{"name":"American Journal of Otolaryngology","volume":"46 4","pages":"Article 104672"},"PeriodicalIF":1.8000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Otolaryngology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0196070925000754","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"OTORHINOLARYNGOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Objectives
Voice disorder is characterized by disruptions in voice quality caused by issues in vocal fold vibration during phonation. The study explored the application of machine learning, based on the Random Forest (RF) and Decision Tree (DT) models, in the classification of normophonic and disordered voices using acoustic features. The RF and DT classifiers were compared, and the diagnostic utility of individual acoustic parameters was evaluated across multilingual databases, with an emphasis on Cantonese voice samples.
Methods
Sustained vowel /a/ recordings were extracted from the Saarbruecken Voice Database, the Perceptual Voice Qualities Database, and a local Cantonese clinical repository. A total of 1986 samples were used for training and testing. Twenty-nine acoustic features were extracted using Parselmouth, a Python interface to Praat. RF and DT models were trained on overseas data and validated on local Cantonese recordings. The RF and DT models were compared based on classification accuracy, sensitivity, specificity, and F1-score. Feature importance was assessed using Mean Decrease in Impurity (MDI) and Mean Decrease in Accuracy (MDA). Receiver Operating Characteristic (ROC) analysis was performed to evaluate the discriminative ability of each acoustic parameter by sex and dataset origin.
Results
The RF model outperformed the DT model, with RF achieving an accuracy of 89 %, precision of 79 %, and F1 score of 77 %, compared to 78 % accuracy and 61 % F1 score associated with DT. RF demonstrated superior true positive and negative rates, and lower false negative rates, making it more suitable for clinical applications. Acoustic feature analysis identified age, CSID, and shimmer and jitter measures as key contributors to classification performance. ROC analyses revealed that CSID and stdevF0Hz were reliable discriminators for male voices, while CSID, localabsoluteJitter, apq11Shimmer, and localdbShimmer demonstrated strong classification performance in female voices across all datasets. However, threshold variability between local and overseas datasets highlights the need for population-specific calibration.
Conclusion
This study underscores the potential of machine learning, particularly the RF algorithm, in enhancing the accuracy of voice disorder diagnosis by automating acoustic feature analysis. The integration of such models into clinical practice could offer more reliable, non-invasive methods for early detection and management of voice disorders, thus improving patient outcomes. Future research should focus on expanding dataset diversity and further validation to enhance the generalizability and clinical applicability of these findings.
期刊介绍:
Be fully informed about developments in otology, neurotology, audiology, rhinology, allergy, laryngology, speech science, bronchoesophagology, facial plastic surgery, and head and neck surgery. Featured sections include original contributions, grand rounds, current reviews, case reports and socioeconomics.