{"title":"语言影响疾病的野外端到端检测","authors":"Joana Correia, I. Trancoso, B. Raj","doi":"10.1109/ASRU46091.2019.9003754","DOIUrl":null,"url":null,"abstract":"Speech is a complex bio-signal that has the potential to provide a rich bio-marker for health. It enables the development of non-invasive routes to early diagnosis and monitoring of speech affecting diseases, such as the ones studied in this work: Depression, and Parkinson's Disease. However, the major limitation of current speech based diagnosis and monitoring tools is the lack of large and diverse datasets. Existing datasets are small, and collected under very controlled conditions. As such, there is an upper bound in the complexity of the models that can be trained using these datasets. There is also limited applicability in real life scenarios where the channel and noise conditions, among others, are impossible to control. In this work, we show that datasets collected from in-the-wild sources, such as collections of vlogs, can contribute to improve the performance of diagnosis tools both in controlled and in-the-wild conditions, even though the data are noisier. Moreover, we show that it is possible to successfully move away from hand-crafted features (i.e. features that are computed based on predefined algorithms, that based on human expertise) and adopt end-to-end modeling paradigms, such as CNN-LSTMs, that extract data driven features from the raw spectrograms of the speech signal, and capture temporal information from the speech signals.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"In-the-Wild End-to-End Detection of Speech Affecting Diseases\",\"authors\":\"Joana Correia, I. Trancoso, B. Raj\",\"doi\":\"10.1109/ASRU46091.2019.9003754\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech is a complex bio-signal that has the potential to provide a rich bio-marker for health. It enables the development of non-invasive routes to early diagnosis and monitoring of speech affecting diseases, such as the ones studied in this work: Depression, and Parkinson's Disease. However, the major limitation of current speech based diagnosis and monitoring tools is the lack of large and diverse datasets. Existing datasets are small, and collected under very controlled conditions. As such, there is an upper bound in the complexity of the models that can be trained using these datasets. There is also limited applicability in real life scenarios where the channel and noise conditions, among others, are impossible to control. In this work, we show that datasets collected from in-the-wild sources, such as collections of vlogs, can contribute to improve the performance of diagnosis tools both in controlled and in-the-wild conditions, even though the data are noisier. Moreover, we show that it is possible to successfully move away from hand-crafted features (i.e. features that are computed based on predefined algorithms, that based on human expertise) and adopt end-to-end modeling paradigms, such as CNN-LSTMs, that extract data driven features from the raw spectrograms of the speech signal, and capture temporal information from the speech signals.\",\"PeriodicalId\":150913,\"journal\":{\"name\":\"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASRU46091.2019.9003754\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003754","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
In-the-Wild End-to-End Detection of Speech Affecting Diseases
Speech is a complex bio-signal that has the potential to provide a rich bio-marker for health. It enables the development of non-invasive routes to early diagnosis and monitoring of speech affecting diseases, such as the ones studied in this work: Depression, and Parkinson's Disease. However, the major limitation of current speech based diagnosis and monitoring tools is the lack of large and diverse datasets. Existing datasets are small, and collected under very controlled conditions. As such, there is an upper bound in the complexity of the models that can be trained using these datasets. There is also limited applicability in real life scenarios where the channel and noise conditions, among others, are impossible to control. In this work, we show that datasets collected from in-the-wild sources, such as collections of vlogs, can contribute to improve the performance of diagnosis tools both in controlled and in-the-wild conditions, even though the data are noisier. Moreover, we show that it is possible to successfully move away from hand-crafted features (i.e. features that are computed based on predefined algorithms, that based on human expertise) and adopt end-to-end modeling paradigms, such as CNN-LSTMs, that extract data driven features from the raw spectrograms of the speech signal, and capture temporal information from the speech signals.