{"title":"Multimodal deep learning methods for speech and language rehabilitation: a cross-sectional observational study.","authors":"Xinqiao Cen","doi":"10.1080/17483107.2025.2551708","DOIUrl":null,"url":null,"abstract":"<p><p>The speech and language rehabilitation are essential to people who have disorders of communication that may occur due to the condition of neurological disorder, developmental delays, or bodily disabilities. With the advent of deep learning, we introduce an improved multimodal rehabilitation pipeline that incorporates audio, video, and text information in order to provide patient-tailored therapy that adapts to the patient. The technique uses a cross-attention fusion multimodal hierarchical transformer architectural model that allows it to jointly design speech acoustics as well as the facial dynamics, lip articulation, and linguistic context. We adopt the strategy of self-supervised pretraining on large-scale unlabelled corpora and domain-adaptive fine-tuning with data augmentation in order to overcome the problem of cohort size and interpatient variability. A low latency inference architecture will provide real-time feedback and individualised changes to therapy. Clinical and synthetic test results show our method trained and verified on clinical and synthetic data fare better than uni-modal and conventional fusion baselines in terms of accuracy, patient engagement, and measurable therapeutic benefit. Such findings point out opportunities of using intelligent, multimodal deep learning systems to reinvent future of speech and language rehabilitation.</p>","PeriodicalId":47806,"journal":{"name":"Disability and Rehabilitation-Assistive Technology","volume":" ","pages":"1-13"},"PeriodicalIF":2.2000,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Disability and Rehabilitation-Assistive Technology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/17483107.2025.2551708","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"REHABILITATION","Score":null,"Total":0}
引用次数: 0
Abstract
The speech and language rehabilitation are essential to people who have disorders of communication that may occur due to the condition of neurological disorder, developmental delays, or bodily disabilities. With the advent of deep learning, we introduce an improved multimodal rehabilitation pipeline that incorporates audio, video, and text information in order to provide patient-tailored therapy that adapts to the patient. The technique uses a cross-attention fusion multimodal hierarchical transformer architectural model that allows it to jointly design speech acoustics as well as the facial dynamics, lip articulation, and linguistic context. We adopt the strategy of self-supervised pretraining on large-scale unlabelled corpora and domain-adaptive fine-tuning with data augmentation in order to overcome the problem of cohort size and interpatient variability. A low latency inference architecture will provide real-time feedback and individualised changes to therapy. Clinical and synthetic test results show our method trained and verified on clinical and synthetic data fare better than uni-modal and conventional fusion baselines in terms of accuracy, patient engagement, and measurable therapeutic benefit. Such findings point out opportunities of using intelligent, multimodal deep learning systems to reinvent future of speech and language rehabilitation.