2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献_第2页

In-the-Wild End-to-End Detection of Speech Affecting Diseases 语言影响疾病的野外端到端检测

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003754

Joana Correia, I. Trancoso, B. Raj

{"title":"In-the-Wild End-to-End Detection of Speech Affecting Diseases","authors":"Joana Correia, I. Trancoso, B. Raj","doi":"10.1109/ASRU46091.2019.9003754","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003754","url":null,"abstract":"Speech is a complex bio-signal that has the potential to provide a rich bio-marker for health. It enables the development of non-invasive routes to early diagnosis and monitoring of speech affecting diseases, such as the ones studied in this work: Depression, and Parkinson's Disease. However, the major limitation of current speech based diagnosis and monitoring tools is the lack of large and diverse datasets. Existing datasets are small, and collected under very controlled conditions. As such, there is an upper bound in the complexity of the models that can be trained using these datasets. There is also limited applicability in real life scenarios where the channel and noise conditions, among others, are impossible to control. In this work, we show that datasets collected from in-the-wild sources, such as collections of vlogs, can contribute to improve the performance of diagnosis tools both in controlled and in-the-wild conditions, even though the data are noisier. Moreover, we show that it is possible to successfully move away from hand-crafted features (i.e. features that are computed based on predefined algorithms, that based on human expertise) and adopt end-to-end modeling paradigms, such as CNN-LSTMs, that extract data driven features from the raw spectrograms of the speech signal, and capture temporal information from the speech signals.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132895179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A Unified Endpointer Using Multitask and Multidomain Training 使用多任务和多域训练的统一终端指针

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003787

Shuo-yiin Chang, Bo Li, Gabor Simko

{"title":"A Unified Endpointer Using Multitask and Multidomain Training","authors":"Shuo-yiin Chang, Bo Li, Gabor Simko","doi":"10.1109/ASRU46091.2019.9003787","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003787","url":null,"abstract":"In speech recognition systems, we generally differentiate the role of endpointers between long-form speech and voice queries, where they are responsible for speech detection and query endpoint detection respectively. Detection of speech is useful for segmentation and pre-filtering in long-form speech processing. On the other hand, query endpoint detection predicts when to stop listening and send audio received so far for actions. It thus determines system latency and is an essential component for interactive voice systems. For both tasks, endpointer needs to be robust in challenging environments, including noisy conditions, reverberant environments and environments with background speech, and it has to generalize well to different domains with different speaking styles and rhythms. This work investigates building a unified endpointer by folding the separate speech detection and query endpoint detection tasks into a single neural network model through multitask learning. A categorical domain representation is further incorporated into the model to encourage learning domain specific information. The final unified model achieves around 100 ms (18% relatively) latency improvement for near-field voice queries and 150 ms (21% relatively) for far-field voice queries over simply pooling all the data together and 7% relative frame error rate reduction for long-form speech compared to a standalone speech detection model. The proposed approach also shows good robustness to noisy environments and yields 180 ms latency improvement on voice queries from an unseen domain.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131674373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Simplified LSTMS for Speech Recognition 用于语音识别的简化LSTMS

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003898

G. Saon, Zoltán Tüske, Kartik Audhkhasi, Brian Kingsbury, M. Picheny, Samuel Thomas

{"title":"Simplified LSTMS for Speech Recognition","authors":"G. Saon, Zoltán Tüske, Kartik Audhkhasi, Brian Kingsbury, M. Picheny, Samuel Thomas","doi":"10.1109/ASRU46091.2019.9003898","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003898","url":null,"abstract":"In this paper we explore new variants of Long Short-Term Memory (LSTM) networks for sequential modeling of acoustic features. In particular, we show that: (i) removing the output gate, (ii) replacing the hyperbolic tangent nonlinearity at the cell output with hard tanh, and (iii) collapsing the cell and hidden state vectors leads to a model that is conceptually simpler than and comparable in effectiveness to a regular LSTM for speech recognition. The proposed model has 25% fewer parameters than an LSTM with the same number of cells, trains faster because it has larger gradients leading to larger steps in weight space, and reaches a better optimum because there are fewer nonlinearities to traverse across layers. We report experimental results for both hybrid and CTC acoustic models on three publicly available English datasets: Switchboard 300 hours telephone conversations, 400 hours broadcast news transcription, and the MALACH 176 hours corpus of Holocaust survivor testimonies. In all cases the proposed models achieve similar or better accuracy than regular LSTMs while being conceptually simpler.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125237468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Multi Purpose and Large Scale Speech Corpus in Persian and English for Speaker and Speech Recognition: The Deepmine Database 用于说话人和语音识别的波斯语和英语多用途大规模语音语料库:Deepmine数据库

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003882

Hossein Zeinali, L. Burget, J. Černocký

{"title":"A Multi Purpose and Large Scale Speech Corpus in Persian and English for Speaker and Speech Recognition: The Deepmine Database","authors":"Hossein Zeinali, L. Burget, J. Černocký","doi":"10.1109/ASRU46091.2019.9003882","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003882","url":null,"abstract":"DeepMine is a speech database in Persian and English designed to build and evaluate text-dependent, text-prompted, and text-independent speaker verification, as well as Persian speech recognition systems. It contains more than 1850 speakers and 540 thousand recordings overall, more than 480 hours of speech are transcribed. It is the first public large-scale speaker verification database in Persian, the largest public text-dependent and text-prompted speaker verification database in English, and the largest public evaluation dataset for text-independent speaker verification. It has a good coverage of age, gender, and accents. We provide several evaluation protocols for each part of the database to allow for research on different aspects of speaker verification. We also provide the results of several experiments that can be considered as baselines: HMM-based i-vectors for text-dependent speaker verification, and HMM-based as well as state-of-the-art deep neural network based ASR. We demonstrate that the database can serve for training robust ASR models.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"332 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122834744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

The MGB-5 Challenge: Recognition and Dialect Identification of Dialectal Arabic Speech MGB-5的挑战:阿拉伯方言语音的识别和方言识别

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003960

Ahmed M. Ali, Suwon Shon, Younes Samih, Hamdy Mubarak, Ahmed Abdelali, James R. Glass, S. Renals, K. Choukri

{"title":"The MGB-5 Challenge: Recognition and Dialect Identification of Dialectal Arabic Speech","authors":"Ahmed M. Ali, Suwon Shon, Younes Samih, Hamdy Mubarak, Ahmed Abdelali, James R. Glass, S. Renals, K. Choukri","doi":"10.1109/ASRU46091.2019.9003960","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003960","url":null,"abstract":"This paper describes the fifth edition of the Multi-Genre Broadcast Challenge (MGB-5), an evaluation focused on Arabic speech recognition and dialect identification. MGB-5 extends the previous MGB-3 challenge in two ways: first it focuses on Moroccan Arabic speech recognition; second the granularity of the Arabic dialect identification task is increased from 5 dialect classes to 17, by collecting data from 17 Arabic speaking countries. Both tasks use YouTube recordings to provide a multi-genre multi-dialectal challenge in the wild. Moroccan speech transcription used about 13 hours of transcribed speech data, split across training, development, and test sets, covering 7-genres: comedy, cooking, family/kids, fashion, drama, sports, and science (TEDx). The fine-grained Arabic dialect identification data was collected from known YouTube channels from 17 Arabic countries. 3,000 hours of this data was released for training, and 57 hours for development and testing. The dialect identification data was divided into three sub-categories based on the segment duration: short (under 5 s), medium (5–20 s), and long (>20 s). Overall, 25 teams registered for the challenge, and 9 teams submitted systems for the two tasks. We outline the approaches adopted in each system and summarize the evaluation results.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131258427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

Improving Fundamental Frequency Generation in EMG-to-Speech Conversion Using a Quantization Approach 用量化方法改进肌电-语音转换中的基频生成

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003804

Lorenz Diener, Tejas Umesh, Tanja Schultz

引用次数: 10

An Investigation of LSTM-CTC based Joint Acoustic Model for Indian Language Identification 基于LSTM-CTC联合声学模型的印度语识别研究

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003784

Tirusha Mandava, R. Vuddagiri, Hari Krishna Vydana, A. Vuppala

引用次数: 4

Native Language Identification from Raw Waveforms Using Deep Convolutional Neural Networks with Attentive Pooling 基于细心池化的深度卷积神经网络的原始波形母语识别

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003872

Rutuja Ubale, Vikram Ramanarayanan, Yao Qian, Keelan Evanini, C. W. Leong, Chong Min Lee

{"title":"Native Language Identification from Raw Waveforms Using Deep Convolutional Neural Networks with Attentive Pooling","authors":"Rutuja Ubale, Vikram Ramanarayanan, Yao Qian, Keelan Evanini, C. W. Leong, Chong Min Lee","doi":"10.1109/ASRU46091.2019.9003872","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003872","url":null,"abstract":"Automatic detection of an individual's native language (L1) based on speech data from their second language (L2) can be useful for informing a variety of speech applications such as automatic speech recognition (ASR), speaker recognition, voice biometrics, and computer assisted language learning (CALL). Previously proposed systems for native language identification from L2 acoustic signals rely on traditional feature extraction pipelines to extract relevant features such as mel-filterbanks, cepstral coefficients, i-vectors, etc. In this paper, we present a fully convolutional neural network approach that is trained end-to-end to predict the native language of the speaker directly from the raw waveforms, thereby removing the feature extraction step altogether. Experimental results using this approach on a database of 11 different L1s suggest that the learnable convolutional layers of our proposed attention-based end-to-end model extract meaningful features from raw waveforms. Further, the attentive pooling mechanism in our proposed network enables our model to focus on the most discriminative features leading to improvements over the conventional baseline.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130367200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Time-Domain Speaker Extraction Network 时域说话人提取网络

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9004016

Chenglin Xu, Wei Rao, Chng Eng Siong, Haizhou Li

{"title":"Time-Domain Speaker Extraction Network","authors":"Chenglin Xu, Wei Rao, Chng Eng Siong, Haizhou Li","doi":"10.1109/ASRU46091.2019.9004016","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004016","url":null,"abstract":"Speaker extraction is to extract a target speaker's voice from multi-talker speech. It simulates humans' cocktail party effect or the selective listening ability. The prior work mostly performs speaker extraction in frequency domain, then reconstructs the signal with some phase approximation. The inaccuracy of phase estimation is inherent to the frequency domain processing, that affects the quality of signal reconstruction. In this paper, we propose a time-domain speaker extraction network (TseNet) that doesn't decompose the speech signal into magnitude and phase spectrums, therefore, doesn't require phase estimation. The TseNet consists of a stack of dilated depthwise separable convolutional networks, that capture the long-range dependency of the speech signal with a manageable number of parameters. It is also conditioned on a reference voice from the target speaker, that is characterized by speaker i-vector, to perform the selective listening to the target speaker. Experiments show that the proposed TseNet achieves 16.3% and 7.0% relative improvements over the baseline in terms of signal-to-distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ) under open evaluation condition.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114287367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Knowledge Distillation from Bert in Pre-Training and Fine-Tuning for Polyphone Disambiguation 多音素消歧预训练与微调中的Bert知识蒸馏

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003918

Hao Sun, Xu Tan, Jun-Wei Gan, Sheng Zhao, Dongxu Han, Hongzhi Liu, Tao Qin, Tie-Yan Liu

{"title":"Knowledge Distillation from Bert in Pre-Training and Fine-Tuning for Polyphone Disambiguation","authors":"Hao Sun, Xu Tan, Jun-Wei Gan, Sheng Zhao, Dongxu Han, Hongzhi Liu, Tao Qin, Tie-Yan Liu","doi":"10.1109/ASRU46091.2019.9003918","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003918","url":null,"abstract":"Polyphone disambiguation aims to select the correct pronunciation for a polyphonic word from several candidates, which is important for text-to-speech synthesis. Since the pronunciation of a polyphonic word is usually decided by its context, polyphone disambiguation can be regarded as a language understanding task. Inspired by the success of BERT for language understanding, we propose to leverage pre-trained BERT models for polyphone disambiguation. However, BERT models are usually too heavy to be served online, in terms of both memory cost and inference speed. In this work, we focus on efficient model for polyphone disambiguation and propose a two-stage knowledge distillation method that transfers the knowledge from a heavy BERT model in both pre-training and fine-tuning stages to a lightweight BERT model, in order to reduce online serving cost. Experiments on Chinese and English polyphone disambiguation datasets demonstrate that our method reduces model parameters by a factor of 5 and improves inference speed by 7 times, while nearly matches the classification accuracy (95.4% on Chinese and 98.1% on English) to the original BERT model.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124855465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8